Last checked: 5 minutes ago
Get notified about any outages, downtime or incidents for Pusher and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Pusher.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Beams | Active |
Beams dashboard | Active |
Channels Dashboard | Active |
Channels presence channels | Active |
Channels Pusher.js CDN | Active |
Channels REST API | Active |
Channels Stats Integrations | Active |
Channels Webhooks | Active |
Channels WebSocket client API | Active |
Marketing Website | Active |
Payment API | Active |
View the latest incidents for Pusher and check for official updates:
Description: # Summary On March 17th, at 05:48 UTC, on-call engineers were paged by our alerting system that there is a problem with the webhooks and increased api errors on MT1. API requests and webhooks related to presence channels members, and channel existence for 50% of apps using these features were affected until the incident was resolved at 06:37 UTC. The affected Redis shard cluster is only responsible for specific features: API requests and webhooks related to presence channels members, and channel existence. There are two redis shard clusters in MT1. One of these shard clusters went down due to problems in the underlying infrastructure. Bringing this shard cluster back up resolved the incident. # Incident Timeline * At 05:44 UTC, a single Redis shard cluster on MT1 went down. * At 06:04 UTC, an engineer investigating the issue identified a possible solution and began implementing a resolution. * At 06:31 UTC, All nodes for the affected Redis shard cluster were replaced. * At 06:37 UTC, All affected systems were operational and the incident was resolved. # Root Cause A single Redis shard cluster had repeated failures in doing full synchronisation from the primary instance to the replicas due to insufficient disk space on the primary instance. At some point, the request for full synchronisation from two replicas was aligned in a way that resulted in the memory of the primary instance constantly increasing until it went out-of-memory. At this point, the primary instance went down. # How will we ensure this does not happen again? * Check the disk space provisioned for Redis nodes and make sure they have sufficient disk space to support full synchronisation. * Add monitoring to get alerted when the disk space on Redis nodes is running low. * Investigate why partial synchronisation failed in the first place which resulted in the need for full synchronisation.
Status: Postmortem
Impact: None | Started At: March 17, 2023, 6 a.m.
Description: Elevated error rate in the US2 cluster - Channels API # Summary On March 16th between 16:44 and 00:03 UTC on March 17th, we experienced an increased error rate in the US2 cluster. This resulted in higher than normal publish latency and a portion of the traffic receiving 5XX and timeout errors. The main cause was identified as bandwidth saturation, but a bug in the monitoring dashboard, as well as difficulty adding more resources, slowed the investigation and resolution. The issue resurfaced on March 17th, between 15:43 and 21:23, impacting only up to 1% of the US2 cluster's traffic. Note: All times mentioned in this postmortem report are in UTC unless otherwise specified. # Timeline On March 16th, at 16:44, we received notifications about increased error rate in the US2 cluster. The on-call engineers were quickly paged and began investigating. It was observed that publish latency had increased and a portion of the traffic was receiving 5XX and timeout errors. Our engineers initially suspected an issue with Redis, as some containers were restarting due to connection issues. However, after checking the Redis health dashboard, they found no abnormalities. It was also confirmed that the ongoing issue was not related to recent deployments. Further investigation indicated that the problem lay within our Redis clusters. Every channel cluster uses multiple Redis clusters for various responsibilities, and engineers observed that the application was reporting connectivity errors for two different Redis clusters. At 17:45, a manual failover was performed in one of the Redis clusters, resulting in slight improvement, but the issue remained unresolved. At 18:10, more engineers joined the incident response team and discovered that the problem was due to bandwidth saturation, as seen on the AWS monitoring dashboard. Our internal monitoring dashboard was underreporting bandwidth values due to a bug in byte conversion, which had prevented the incident responders from identifying the problem earlier. 19:14 UTC, a decision was made to resize the affected Redis nodes using larger instance sizes. By this time, all instances had been refreshed. While containers stopped reporting connectivity issues in their logs, issues still persisted in our messaging Redis cluster. The incident response team quickly decided to add additional shards for Redis messaging. However, this was the first time that resharding was required in our AWS EKS environment, as we had recently migrated the US2 cluster from our self-managed Kubernetes infrastructure. Engineers had to introduce the necessary parameters in the configmaps for our EKS cluster, which took some time. The resharding is a slow multi-phase process, but it allows us to make changes to a production cluster without losing any writes. We have a long grace period to drain socket connections, ensuring that similar operations would not impact a large portion of our traffic. Preparations for resharding were completed by 21:15 UTC, and the first phase of resharding started. By 00:03 UTC, resharding was completed, and everything was operational. On March 17th, while the team was still investigating the first incident, the issue resurfaced at 15:43 UTC, but on a much smaller scale, affecting up to 1% of the traffic. The team quickly responded by adding more capacity to the cluster. # Root cause Redis nodes in the affected cluster had been operating beyond their full network baseline capacity during peak hours for some time, but we had not been proactively monitoring this. AWS provides burst network performance, and evidently we had always had enough network I/O credits to allow instances to use burst bandwidth. These instances also earn network I/O credits during off-peak hours in the US2 cluster. We never rely on I/O credits, but we believe a combination of these factors contributed to us not noticing capacity issues in the cluster until now. The incident occurred when we experienced an unusual amount of traffic. The instance burst is on a best-effort basis, even when the instance has credits available. As burst bandwidth is a shared resource, it is never guaranteed that AWS can allocate it, and we believe that's precisely what happened that night to a few of our Redis nodes. Further investigations by our engineering team led to the identification of a new theory that suggests one of our high volume customers may be triggering an edge case bug in our system due to a unique combination of factors in their use case after a recent change they made in their application. We require additional data to confirm this theory. Our next course of action is to isolate their traffic and reduce the load on the US2 cluster and continue with our investigation. We will provide timely updates as we make progress in this investigation. # How will we ensure this does not happen again? To prevent similar incidents from happening in the future, we have taken several steps. Firstly, we will implement monitoring to identify nodes that regularly exceed their capacity during peak hours. This will allow us to perform a right-sizing exercise and update instances that have similar capacity issues more frequently. Our long-term goal is to automate this process to ensure continuous optimization. Secondly, we will fix the bug in our Redis dashboard to ensure that network performance is accurately reported. Additionally, we will add native AWS metrics to our Grafana dashboard to provide better visibility into our infrastructure's network performance.
Status: Postmortem
Impact: Minor | Started At: March 16, 2023, 5:27 p.m.
Description: Datadog have confirmed the resolution of the issue in their platform and our stats integration should now function as expected.
Status: Resolved
Impact: Minor | Started At: March 8, 2023, 9:13 a.m.
Description: During a migration to new infrastructure, we encountered a bug causing high latency and timeout errors when publishing messages to API. Engineers were notified at 15:31 and began investigating. At 15:45, we decided to roll back the migration, which resulted in reduced latencies and restored normal service by 16:48. Although this migration was a routine task that had been successfully completed on other clusters without any issues, in the case of mt1 and us2 clusters, it was observed that the sidecar proxy container, which we deploy with our application, failed to report connection errors due to a bug in its readiness probe. This led to unhealthy pods continuing to operate, ultimately resulting in reduced effective cluster capacity. We have since identified and implemented a fix to prevent this issue from occurring in the future. It is important to note that while we have identified and corrected the issue with the readiness probe in our proxy sidecar container, we are still actively investigating to determine the root cause of the connection issue that occurred in those clusters. We will not attempt another migration until all the issues have been resolved and confirmed via rigorous testing.
Status: Postmortem
Impact: Minor | Started At: Feb. 28, 2023, 3:50 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: Feb. 21, 2023, 10:40 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.