Last checked: 7 minutes ago
Get notified about any outages, downtime or incidents for Rollbar and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Rollbar.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
API Tier (api.rollbar.com) | Active |
Rollbar Docs | Active |
rollbar.min.js | Active |
SCIM and SSO | Active |
Web App (rollbar.com) | Active |
External notification services | Active |
Mailgun Outbound Delivery | Active |
Mailgun SMTP | Active |
Processing pipeline | Active |
Core Processing Pipeline | Active |
iOS Symbolication pipeline | Active |
Proguard processing pipeline | Active |
Source map symbolication pipeline | Active |
View the latest incidents for Rollbar and check for official updates:
Description: This incident has been resolved.
Status: Resolved
Impact: None | Started At: May 15, 2023, 9 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: None | Started At: May 15, 2023, 6:34 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: None | Started At: May 15, 2023, 6:34 a.m.
Description: # Incident Report: Web Outage at Rollbar on April 28th, 2023 ## Summary of Incident and Impact Between 3:45am and 4:19am Pacific Time, Rollbar experienced an outage in its web application due to forced upgrades by our Cloud Provider on a node-pool used exclusively by the web application. The configuration of the web application was overly-specific, which, when combined with the simultaneous updating of the node-pool by the Cloud Provider, resulted in Kubernetes being unable to schedule the pods. Consequently, the web application experienced slow performance from 3:45am until 4:00am, and it was rendered completely unavailable from 4:06am to 4:19am. The issue was resolved by updating the tolerances and taints for the workload to allow it to use a more diverse set of pools within our cluster. ## Detailed Account of the Incident At 3:45am PT, the web application began experiencing slow performance due to the Cloud Provider initiating forced upgrades on a node-pool dedicated to the web application. As a result of the overly-specific configuration of the web application and the simultaneous updating of the entire node-pool by the Cloud Provider, Kubernetes was unable to schedule the necessary pods. The slow performance persisted until 4:00am, when the web application's availability began to degrade further. By 4:06am, the web application was completely unavailable. In order to resolve the issue, the team updated the tolerances and taints for the workload, allowing it to utilize a broader range of pools in our cluster. This action successfully resolved the problem, and the web application was restored to full functionality by 4:19am PT. ## Follow-Up Actions To mitigate the risk of future outages and ensure the continued stability of the platform, the following actions are being implemented: 1. Removal of the dedicated node-pool for the web application: This action has already been completed, allowing the web application to utilize a more diverse range of node-pools and preventing a single point of failure. 2. Improvements to monitoring and alerting: Updates to our monitoring and alerting systems will be made to better detect and manage scheduling issues in Kubernetes, ultimately improving our response time to potential issues. 3. Enhancements to the web application's auto-scaling and alerting: Work is underway to improve the auto-scaling capabilities of the web application, with a focus on directly tying these improvements into alerting systems for better responsiveness and reliability.
Status: Postmortem
Impact: None | Started At: April 28, 2023, 11:30 a.m.
Description: # Incident Report: Web Outage at Rollbar on April 28th, 2023 ## Summary of Incident and Impact Between 3:45am and 4:19am Pacific Time, Rollbar experienced an outage in its web application due to forced upgrades by our Cloud Provider on a node-pool used exclusively by the web application. The configuration of the web application was overly-specific, which, when combined with the simultaneous updating of the node-pool by the Cloud Provider, resulted in Kubernetes being unable to schedule the pods. Consequently, the web application experienced slow performance from 3:45am until 4:00am, and it was rendered completely unavailable from 4:06am to 4:19am. The issue was resolved by updating the tolerances and taints for the workload to allow it to use a more diverse set of pools within our cluster. ## Detailed Account of the Incident At 3:45am PT, the web application began experiencing slow performance due to the Cloud Provider initiating forced upgrades on a node-pool dedicated to the web application. As a result of the overly-specific configuration of the web application and the simultaneous updating of the entire node-pool by the Cloud Provider, Kubernetes was unable to schedule the necessary pods. The slow performance persisted until 4:00am, when the web application's availability began to degrade further. By 4:06am, the web application was completely unavailable. In order to resolve the issue, the team updated the tolerances and taints for the workload, allowing it to utilize a broader range of pools in our cluster. This action successfully resolved the problem, and the web application was restored to full functionality by 4:19am PT. ## Follow-Up Actions To mitigate the risk of future outages and ensure the continued stability of the platform, the following actions are being implemented: 1. Removal of the dedicated node-pool for the web application: This action has already been completed, allowing the web application to utilize a more diverse range of node-pools and preventing a single point of failure. 2. Improvements to monitoring and alerting: Updates to our monitoring and alerting systems will be made to better detect and manage scheduling issues in Kubernetes, ultimately improving our response time to potential issues. 3. Enhancements to the web application's auto-scaling and alerting: Work is underway to improve the auto-scaling capabilities of the web application, with a focus on directly tying these improvements into alerting systems for better responsiveness and reliability.
Status: Postmortem
Impact: None | Started At: April 28, 2023, 11:30 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.