Pusher Status: Check if Pusher down or having an outage.

Pusher outages and incidents

Outage and incident data over the last 30 days for Pusher.

There have been 1 outages or incidents for Pusher in the last 30 days.

Severity Breakdown:

None: 0

Minor: 1

Major: 0

Critical: 0

Tired of searching for status updates?

Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!

Components and Services Monitored for Pusher

Outlogger tracks the status of these components for Xero:

Beams Active

Beams dashboard Active

Channels Dashboard Active

Channels presence channels Active

Channels Pusher.js CDN Active

Channels REST API Active

Channels Stats Integrations Active

Channels Webhooks Active

Channels WebSocket client API Active

Marketing Website Active

Payment API Active

Component	Status
Beams	Active
Beams dashboard	Active
Channels Dashboard	Active
Channels presence channels	Active
Channels Pusher.js CDN	Active
Channels REST API	Active
Channels Stats Integrations	Active
Channels Webhooks	Active
Channels WebSocket client API	Active
Marketing Website	Active
Payment API	Active

Latest Pusher outages and incidents.

View the latest incidents for Pusher and check for official updates:

Partial outage in MT1 cluster affecting customers using presence and presence webhook between 5:44 to 6:35 UTC

Description: # Summary On March 17th, at 05:48 UTC, on-call engineers were paged by our alerting system that there is a problem with the webhooks and increased api errors on MT1. API requests and webhooks related to presence channels members, and channel existence for 50% of apps using these features were affected until the incident was resolved at 06:37 UTC. The affected Redis shard cluster is only responsible for specific features: API requests and webhooks related to presence channels members, and channel existence. There are two redis shard clusters in MT1. One of these shard clusters went down due to problems in the underlying infrastructure. Bringing this shard cluster back up resolved the incident. # Incident Timeline * At 05:44 UTC, a single Redis shard cluster on MT1 went down. * At 06:04 UTC, an engineer investigating the issue identified a possible solution and began implementing a resolution. * At 06:31 UTC, All nodes for the affected Redis shard cluster were replaced. * At 06:37 UTC, All affected systems were operational and the incident was resolved. # Root Cause A single Redis shard cluster had repeated failures in doing full synchronisation from the primary instance to the replicas due to insufficient disk space on the primary instance. At some point, the request for full synchronisation from two replicas was aligned in a way that resulted in the memory of the primary instance constantly increasing until it went out-of-memory. At this point, the primary instance went down. # How will we ensure this does not happen again? * Check the disk space provisioned for Redis nodes and make sure they have sufficient disk space to support full synchronisation. * Add monitoring to get alerted when the disk space on Redis nodes is running low. * Investigate why partial synchronisation failed in the first place which resulted in the need for full synchronisation.

Status: Postmortem

Impact: None | Started At: March 17, 2023, 6 a.m.

Updates:

Time: March 17, 2023, 11:17 a.m.

Status: Postmortem

Update: # Summary On March 17th, at 05:48 UTC, on-call engineers were paged by our alerting system that there is a problem with the webhooks and increased api errors on MT1. API requests and webhooks related to presence channels members, and channel existence for 50% of apps using these features were affected until the incident was resolved at 06:37 UTC. The affected Redis shard cluster is only responsible for specific features: API requests and webhooks related to presence channels members, and channel existence. There are two redis shard clusters in MT1. One of these shard clusters went down due to problems in the underlying infrastructure. Bringing this shard cluster back up resolved the incident. # Incident Timeline * At 05:44 UTC, a single Redis shard cluster on MT1 went down. * At 06:04 UTC, an engineer investigating the issue identified a possible solution and began implementing a resolution. * At 06:31 UTC, All nodes for the affected Redis shard cluster were replaced. * At 06:37 UTC, All affected systems were operational and the incident was resolved. # Root Cause A single Redis shard cluster had repeated failures in doing full synchronisation from the primary instance to the replicas due to insufficient disk space on the primary instance. At some point, the request for full synchronisation from two replicas was aligned in a way that resulted in the memory of the primary instance constantly increasing until it went out-of-memory. At this point, the primary instance went down. # How will we ensure this does not happen again? * Check the disk space provisioned for Redis nodes and make sure they have sufficient disk space to support full synchronisation. * Add monitoring to get alerted when the disk space on Redis nodes is running low. * Investigate why partial synchronisation failed in the first place which resulted in the need for full synchronisation.
Time: March 17, 2023, 6:57 a.m.

Status: Resolved

Update: Between 5:44 UTC and 6:35 UTC, in MT1 cluster, we experienced a partial outage that affected 50% of clients using Channels presence feature and Channels presence webhook events. During this time, customers who attempted to query presence information from our API may have received a 500 error. We will provide incident report with more information as soon as possible.

Increased error rates in the US2 cluster API

Description: Elevated error rate in the US2 cluster - Channels API # Summary On March 16th between 16:44 and 00:03 UTC on March 17th, we experienced an increased error rate in the US2 cluster. This resulted in higher than normal publish latency and a portion of the traffic receiving 5XX and timeout errors. The main cause was identified as bandwidth saturation, but a bug in the monitoring dashboard, as well as difficulty adding more resources, slowed the investigation and resolution. The issue resurfaced on March 17th, between 15:43 and 21:23, impacting only up to 1% of the US2 cluster's traffic. Note: All times mentioned in this postmortem report are in UTC unless otherwise specified. # Timeline On March 16th, at 16:44, we received notifications about increased error rate in the US2 cluster. The on-call engineers were quickly paged and began investigating. It was observed that publish latency had increased and a portion of the traffic was receiving 5XX and timeout errors. Our engineers initially suspected an issue with Redis, as some containers were restarting due to connection issues. However, after checking the Redis health dashboard, they found no abnormalities. It was also confirmed that the ongoing issue was not related to recent deployments. Further investigation indicated that the problem lay within our Redis clusters. Every channel cluster uses multiple Redis clusters for various responsibilities, and engineers observed that the application was reporting connectivity errors for two different Redis clusters. At 17:45, a manual failover was performed in one of the Redis clusters, resulting in slight improvement, but the issue remained unresolved. At 18:10, more engineers joined the incident response team and discovered that the problem was due to bandwidth saturation, as seen on the AWS monitoring dashboard. Our internal monitoring dashboard was underreporting bandwidth values due to a bug in byte conversion, which had prevented the incident responders from identifying the problem earlier. 19:14 UTC, a decision was made to resize the affected Redis nodes using larger instance sizes. By this time, all instances had been refreshed. While containers stopped reporting connectivity issues in their logs, issues still persisted in our messaging Redis cluster. The incident response team quickly decided to add additional shards for Redis messaging. However, this was the first time that resharding was required in our AWS EKS environment, as we had recently migrated the US2 cluster from our self-managed Kubernetes infrastructure. Engineers had to introduce the necessary parameters in the configmaps for our EKS cluster, which took some time. The resharding is a slow multi-phase process, but it allows us to make changes to a production cluster without losing any writes. We have a long grace period to drain socket connections, ensuring that similar operations would not impact a large portion of our traffic. Preparations for resharding were completed by 21:15 UTC, and the first phase of resharding started. By 00:03 UTC, resharding was completed, and everything was operational. On March 17th, while the team was still investigating the first incident, the issue resurfaced at 15:43 UTC, but on a much smaller scale, affecting up to 1% of the traffic. The team quickly responded by adding more capacity to the cluster. # Root cause Redis nodes in the affected cluster had been operating beyond their full network baseline capacity during peak hours for some time, but we had not been proactively monitoring this. AWS provides burst network performance, and evidently we had always had enough network I/O credits to allow instances to use burst bandwidth. These instances also earn network I/O credits during off-peak hours in the US2 cluster. We never rely on I/O credits, but we believe a combination of these factors contributed to us not noticing capacity issues in the cluster until now. The incident occurred when we experienced an unusual amount of traffic. The instance burst is on a best-effort basis, even when the instance has credits available. As burst bandwidth is a shared resource, it is never guaranteed that AWS can allocate it, and we believe that's precisely what happened that night to a few of our Redis nodes. Further investigations by our engineering team led to the identification of a new theory that suggests one of our high volume customers may be triggering an edge case bug in our system due to a unique combination of factors in their use case after a recent change they made in their application. We require additional data to confirm this theory. Our next course of action is to isolate their traffic and reduce the load on the US2 cluster and continue with our investigation. We will provide timely updates as we make progress in this investigation. # How will we ensure this does not happen again? To prevent similar incidents from happening in the future, we have taken several steps. Firstly, we will implement monitoring to identify nodes that regularly exceed their capacity during peak hours. This will allow us to perform a right-sizing exercise and update instances that have similar capacity issues more frequently. Our long-term goal is to automate this process to ensure continuous optimization. Secondly, we will fix the bug in our Redis dashboard to ensure that network performance is accurately reported. Additionally, we will add native AWS metrics to our Grafana dashboard to provide better visibility into our infrastructure's network performance.

Status: Postmortem

Impact: Minor | Started At: March 16, 2023, 5:27 p.m.

Updates:

Time: March 28, 2023, 3:26 p.m.

Status: Postmortem

Update: Elevated error rate in the US2 cluster - Channels API # Summary On March 16th between 16:44 and 00:03 UTC on March 17th, we experienced an increased error rate in the US2 cluster. This resulted in higher than normal publish latency and a portion of the traffic receiving 5XX and timeout errors. The main cause was identified as bandwidth saturation, but a bug in the monitoring dashboard, as well as difficulty adding more resources, slowed the investigation and resolution. The issue resurfaced on March 17th, between 15:43 and 21:23, impacting only up to 1% of the US2 cluster's traffic. Note: All times mentioned in this postmortem report are in UTC unless otherwise specified. # Timeline On March 16th, at 16:44, we received notifications about increased error rate in the US2 cluster. The on-call engineers were quickly paged and began investigating. It was observed that publish latency had increased and a portion of the traffic was receiving 5XX and timeout errors. Our engineers initially suspected an issue with Redis, as some containers were restarting due to connection issues. However, after checking the Redis health dashboard, they found no abnormalities. It was also confirmed that the ongoing issue was not related to recent deployments. Further investigation indicated that the problem lay within our Redis clusters. Every channel cluster uses multiple Redis clusters for various responsibilities, and engineers observed that the application was reporting connectivity errors for two different Redis clusters. At 17:45, a manual failover was performed in one of the Redis clusters, resulting in slight improvement, but the issue remained unresolved. At 18:10, more engineers joined the incident response team and discovered that the problem was due to bandwidth saturation, as seen on the AWS monitoring dashboard. Our internal monitoring dashboard was underreporting bandwidth values due to a bug in byte conversion, which had prevented the incident responders from identifying the problem earlier. 19:14 UTC, a decision was made to resize the affected Redis nodes using larger instance sizes. By this time, all instances had been refreshed. While containers stopped reporting connectivity issues in their logs, issues still persisted in our messaging Redis cluster. The incident response team quickly decided to add additional shards for Redis messaging. However, this was the first time that resharding was required in our AWS EKS environment, as we had recently migrated the US2 cluster from our self-managed Kubernetes infrastructure. Engineers had to introduce the necessary parameters in the configmaps for our EKS cluster, which took some time. The resharding is a slow multi-phase process, but it allows us to make changes to a production cluster without losing any writes. We have a long grace period to drain socket connections, ensuring that similar operations would not impact a large portion of our traffic. Preparations for resharding were completed by 21:15 UTC, and the first phase of resharding started. By 00:03 UTC, resharding was completed, and everything was operational. On March 17th, while the team was still investigating the first incident, the issue resurfaced at 15:43 UTC, but on a much smaller scale, affecting up to 1% of the traffic. The team quickly responded by adding more capacity to the cluster. # Root cause Redis nodes in the affected cluster had been operating beyond their full network baseline capacity during peak hours for some time, but we had not been proactively monitoring this. AWS provides burst network performance, and evidently we had always had enough network I/O credits to allow instances to use burst bandwidth. These instances also earn network I/O credits during off-peak hours in the US2 cluster. We never rely on I/O credits, but we believe a combination of these factors contributed to us not noticing capacity issues in the cluster until now. The incident occurred when we experienced an unusual amount of traffic. The instance burst is on a best-effort basis, even when the instance has credits available. As burst bandwidth is a shared resource, it is never guaranteed that AWS can allocate it, and we believe that's precisely what happened that night to a few of our Redis nodes. Further investigations by our engineering team led to the identification of a new theory that suggests one of our high volume customers may be triggering an edge case bug in our system due to a unique combination of factors in their use case after a recent change they made in their application. We require additional data to confirm this theory. Our next course of action is to isolate their traffic and reduce the load on the US2 cluster and continue with our investigation. We will provide timely updates as we make progress in this investigation. # How will we ensure this does not happen again? To prevent similar incidents from happening in the future, we have taken several steps. Firstly, we will implement monitoring to identify nodes that regularly exceed their capacity during peak hours. This will allow us to perform a right-sizing exercise and update instances that have similar capacity issues more frequently. Our long-term goal is to automate this process to ensure continuous optimization. Secondly, we will fix the bug in our Redis dashboard to ensure that network performance is accurately reported. Additionally, we will add native AWS metrics to our Grafana dashboard to provide better visibility into our infrastructure's network performance.
Time: March 17, 2023, 12:36 a.m.

Status: Resolved

Update: This incident has been resolved. We will be sharing a detailed incident report in the near future.
Time: March 16, 2023, 10:47 p.m.

Status: Monitoring

Update: A fix has been implemented and we are monitoring the results.
Time: March 16, 2023, 9:43 p.m.

Status: Identified

Update: The issue has been identified and a fix is being implemented.
Time: March 16, 2023, 8:20 p.m.

Status: Investigating

Update: We have made a configuration change on the cluster but this did not resolve the increased error rates. We are continuing to investigate the issue.
Time: March 16, 2023, 5:27 p.m.

Status: Investigating

Update: We are aware of the increased error rates in the US2 cluster API. Our team is currently investigating this issue.

Stats Integration with DataDog encountering errors

Description: Datadog have confirmed the resolution of the issue in their platform and our stats integration should now function as expected.

Status: Resolved

Impact: Minor | Started At: March 8, 2023, 9:13 a.m.

Updates:

Time: March 9, 2023, 9:03 a.m.

Status: Resolved

Update: Datadog have confirmed the resolution of the issue in their platform and our stats integration should now function as expected.
Time: March 8, 2023, 9:13 a.m.

Status: Investigating

Update: We are aware of an issue at DataDog that is causing our integration with their system to encounter errors when publishing data. Customer metrics may be missing in DataDog for the duration of the error. More information about this can be found on the DataDog status page - https://status.datadoghq.com/incidents/nhrdzp86vqtp Our Librato integration, along with the stats available in the Pusher Dashboard, remain unaffected.

Increased latency on MT1 and US2 Clusters - Starting at 15:15 UTC

Description: During a migration to new infrastructure, we encountered a bug causing high latency and timeout errors when publishing messages to API. Engineers were notified at 15:31 and began investigating. At 15:45, we decided to roll back the migration, which resulted in reduced latencies and restored normal service by 16:48. Although this migration was a routine task that had been successfully completed on other clusters without any issues, in the case of mt1 and us2 clusters, it was observed that the sidecar proxy container, which we deploy with our application, failed to report connection errors due to a bug in its readiness probe. This led to unhealthy pods continuing to operate, ultimately resulting in reduced effective cluster capacity. We have since identified and implemented a fix to prevent this issue from occurring in the future. It is important to note that while we have identified and corrected the issue with the readiness probe in our proxy sidecar container, we are still actively investigating to determine the root cause of the connection issue that occurred in those clusters. We will not attempt another migration until all the issues have been resolved and confirmed via rigorous testing.

Status: Postmortem

Impact: Minor | Started At: Feb. 28, 2023, 3:50 p.m.

Updates:

Time: March 2, 2023, 5:01 p.m.

Status: Postmortem

Update: During a migration to new infrastructure, we encountered a bug causing high latency and timeout errors when publishing messages to API. Engineers were notified at 15:31 and began investigating. At 15:45, we decided to roll back the migration, which resulted in reduced latencies and restored normal service by 16:48. Although this migration was a routine task that had been successfully completed on other clusters without any issues, in the case of mt1 and us2 clusters, it was observed that the sidecar proxy container, which we deploy with our application, failed to report connection errors due to a bug in its readiness probe. This led to unhealthy pods continuing to operate, ultimately resulting in reduced effective cluster capacity. We have since identified and implemented a fix to prevent this issue from occurring in the future. It is important to note that while we have identified and corrected the issue with the readiness probe in our proxy sidecar container, we are still actively investigating to determine the root cause of the connection issue that occurred in those clusters. We will not attempt another migration until all the issues have been resolved and confirmed via rigorous testing.
Time: Feb. 28, 2023, 5:06 p.m.

Status: Resolved

Update: This incident has been resolved.
Time: Feb. 28, 2023, 4:53 p.m.

Status: Monitoring

Update: A fix has been implemented and we are monitoring the results.
Time: Feb. 28, 2023, 3:50 p.m.

Status: Identified

Update: A latency increase was observed on the MT1 and US2 clusters after a deployment. Engineers are activity working on a fix / roll back.

Increased publish latency in eu cluster

Description: This incident has been resolved.

Status: Resolved

Impact: Minor | Started At: Feb. 21, 2023, 10:40 a.m.

Updates:

Time: Feb. 21, 2023, 10:53 a.m.

Status: Resolved

Update: This incident has been resolved.
Time: Feb. 21, 2023, 10:40 a.m.

Status: Investigating

Update: We are aware that the publish latency in `eu` cluster especially p99 is high. Our team is currently investigating this issue.

Check the status of similar companies and alternatives to Pusher

Akamai

Systems Active

Nutanix

Systems Active

MongoDB

Systems Active

LogicMonitor

Systems Active

Acquia

Systems Active

Granicus System

Systems Active

CareCloud

Systems Active

Redis

Systems Active

integrator.io

Systems Active

NinjaOne Trust

Systems Active

Pantheon Operations

Systems Active

Securiti US

Systems Active

Frequently Asked Questions - Pusher

Is there a Pusher outage?

The current status of Pusher is: Systems Active

Where can I find the official status page of Pusher?

The official status page for Pusher is here

How can I get notified if Pusher is down or experiencing an outage?

To get notified of any status changes to Pusher, simply sign up to OutLogger's free monitoring service. OutLogger checks the official status of Pusher every few minutes and will notify you of any changes. You can veiw the status of all your cloud vendors in one dashboard. Sign up here

What does Pusher do?

A hosted service offering simple, scalable, and reliable realtime APIs for developers to build live dashboards, notifications, geotracking, chat, and more.

Is there an Pusher outage?

Pusher status: Systems Active

Pusher outages and incidents

There have been 1 outages or incidents for Pusher in the last 30 days.

Severity Breakdown:

Tired of searching for status updates?

Components and Services Monitored for Pusher

Latest Pusher outages and incidents.

Partial outage in MT1 cluster affecting customers using presence and presence webhook between 5:44 to 6:35 UTC

Updates:

Increased error rates in the US2 cluster API

Updates:

Stats Integration with DataDog encountering errors

Updates:

Increased latency on MT1 and US2 Clusters - Starting at 15:15 UTC

Updates:

Increased publish latency in eu cluster

Updates:

Check the status of similar companies and alternatives to Pusher

Akamai

Nutanix

MongoDB

LogicMonitor

Acquia

Granicus System

CareCloud

Redis

integrator.io

NinjaOne Trust

Pantheon Operations

Securiti US

Frequently Asked Questions - Pusher

Is there a Pusher outage?

Where can I find the official status page of Pusher?

How can I get notified if Pusher is down or experiencing an outage?

What does Pusher do?

Start monitoring now!