Last checked: a minute ago
Get notified about any outages, downtime or incidents for Rollbar and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Rollbar.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
API Tier (api.rollbar.com) | Active |
Rollbar Docs | Active |
rollbar.min.js | Active |
SCIM and SSO | Active |
Web App (rollbar.com) | Active |
External notification services | Active |
Mailgun Outbound Delivery | Active |
Mailgun SMTP | Active |
Processing pipeline | Active |
Core Processing Pipeline | Active |
iOS Symbolication pipeline | Active |
Proguard processing pipeline | Active |
Source map symbolication pipeline | Active |
View the latest incidents for Rollbar and check for official updates:
Description: # Incident Report: Platform API Outage at Rollbar on April 27th, 2023 ## Summary of the Incident and Impact On April 27th, 2023, between 9:23 pm and 11:08 pm PT, Rollbar experienced a platform outage affecting the Public API servers \([api.rollbar.com](http://api.rollbar.com)\). This incident occurred during a system maintenance that aimed to rebalance cluster-node-pools for the analytics processing system. The platform outage rendered the Public API servers inaccessible to the load balancer, causing all requests to fail with a 502 error. The issue stemmed from a configuration error that resulted in a longer than expected resolution time. Following this incident, the trailing-12-month uptime of the API tier as measured by our external monitoring service is 99.96%. ## Detailed Account of the Incident The incident was triggered by the removal of a configuration value from Helm, a tool used to configure Kubernetes, during the process of updating and simplifying the configuration. The deployment team was unaware that the removed value was still being utilized by the load balancer templates to determine which deployment to reference. This oversight was due to the template logic originating from a system swap-over in May of 2022. Initially, the team believed that the configuration changes could not have impacted the Public API subsystem, and they attempted to roll back the changes. However, the rollback procedure only reverted the code version and not the configuration templates. The on-call team then investigated the issue further, as they were unsure whether the problem lay with the deployments. Upon identifying the root cause, the team deployed a revised configuration to resolve the incident. ## Follow-Up Actions To mitigate future risks and avoid similar incidents, Rollbar will undertake the following actions: 1. Rollbar’s engineering team is planning a deep review and clean-up of stale or outdated Helm charts. This process will involve revising existing charts to ensure they are up-to-date and relevant. 2. Rollbar will implement automated rollback processes for the Public APIs using canary deployments. This approach will reduce the need for manual intervention and allow the system to revert to a "last good" state automatically in the event of an issue. 3. Additional training will be provided to the team on rollback procedures, specifically focusing on Helm-only rollbacks. Furthermore, the tooling will be updated to ensure that Rollbar can roll back code versions and Helm configurations separately, as needed. By implementing these follow-up actions, Rollbar aims to minimize the chances of similar incidents occurring in the future and ensure a more robust and reliable platform for its users.
Status: Postmortem
Impact: Critical | Started At: April 28, 2023, 5:04 a.m.
Description: # Incident Report: Platform API Outage at Rollbar on April 27th, 2023 ## Summary of the Incident and Impact On April 27th, 2023, between 9:23 pm and 11:08 pm PT, Rollbar experienced a platform outage affecting the Public API servers \([api.rollbar.com](http://api.rollbar.com)\). This incident occurred during a system maintenance that aimed to rebalance cluster-node-pools for the analytics processing system. The platform outage rendered the Public API servers inaccessible to the load balancer, causing all requests to fail with a 502 error. The issue stemmed from a configuration error that resulted in a longer than expected resolution time. Following this incident, the trailing-12-month uptime of the API tier as measured by our external monitoring service is 99.96%. ## Detailed Account of the Incident The incident was triggered by the removal of a configuration value from Helm, a tool used to configure Kubernetes, during the process of updating and simplifying the configuration. The deployment team was unaware that the removed value was still being utilized by the load balancer templates to determine which deployment to reference. This oversight was due to the template logic originating from a system swap-over in May of 2022. Initially, the team believed that the configuration changes could not have impacted the Public API subsystem, and they attempted to roll back the changes. However, the rollback procedure only reverted the code version and not the configuration templates. The on-call team then investigated the issue further, as they were unsure whether the problem lay with the deployments. Upon identifying the root cause, the team deployed a revised configuration to resolve the incident. ## Follow-Up Actions To mitigate future risks and avoid similar incidents, Rollbar will undertake the following actions: 1. Rollbar’s engineering team is planning a deep review and clean-up of stale or outdated Helm charts. This process will involve revising existing charts to ensure they are up-to-date and relevant. 2. Rollbar will implement automated rollback processes for the Public APIs using canary deployments. This approach will reduce the need for manual intervention and allow the system to revert to a "last good" state automatically in the event of an issue. 3. Additional training will be provided to the team on rollback procedures, specifically focusing on Helm-only rollbacks. Furthermore, the tooling will be updated to ensure that Rollbar can roll back code versions and Helm configurations separately, as needed. By implementing these follow-up actions, Rollbar aims to minimize the chances of similar incidents occurring in the future and ensure a more robust and reliable platform for its users.
Status: Postmortem
Impact: Critical | Started At: April 28, 2023, 5:04 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Major | Started At: April 24, 2023, 8:21 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Major | Started At: April 22, 2023, 5:39 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Major | Started At: April 22, 2023, 5:39 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.