Last checked: 33 seconds ago
Get notified about any outages, downtime or incidents for Rollbar and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Rollbar.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
API Tier (api.rollbar.com) | Active |
Rollbar Docs | Active |
rollbar.min.js | Active |
SCIM and SSO | Active |
Web App (rollbar.com) | Active |
External notification services | Active |
Mailgun Outbound Delivery | Active |
Mailgun SMTP | Active |
Processing pipeline | Active |
Core Processing Pipeline | Active |
iOS Symbolication pipeline | Active |
Proguard processing pipeline | Active |
Source map symbolication pipeline | Active |
View the latest incidents for Rollbar and check for official updates:
Description: This incident has been resolved.
Status: Resolved
Impact: Major | Started At: Feb. 26, 2024, 10:14 p.m.
Description: # **Summary of the Incident and Impact** On February 3rd, 2024, between 03:37 and 06:35 PST Rollbar experienced a platform outage affecting the Web Application \([rollbar.com](http://rollbar.com)\) and Platform API \([api.rollbar.com](http://api.rollbar.com)\) servers. The cause of these outages can be traced to an automated update by our Google Cloud Platform to Rollbar’s GKE \(Google Kubernetes Engine\) Clusters. Following this incident, the trailing-12-month uptime of the API tier as measured by our external monitoring service is 99.92%. The upgrade removed firewall rules necessary for health checks originating from Google Cloud Application Load Balancers \(ALBs\) required to be able to send traffic to application servers. Our default network firewall security posture is very strict and removal of rules has significant consequences as we disallow all IP traffic on the relevant ports. The removal of these firewall rules resulted in the inability of workloads on the GKE clusters to communicate with the ALBs thus causing the load balancers to register all workloads as unhealthy. Initially, it was unclear what had happened as no code changes had been deployed by Rollbar nor were changes made directly to any infrastructure. Not knowing that the firewall rules had been eliminated, we attempted to restart applications and create new load balancers from roughly 03:37 to 05:08am. At 05:08 a support ticket was created with Rollbar’s cloud services provider, and Google to help resolve the issue. At 05:11 engineers from the cloud services provider, Google, and Rollbar teleconferenced to try to discuss the issue. After 75 minutes on the support call, the cloud services provider and Google were able to determine that the firewall rules had been removed due to the GKE upgrade. Starting at 06:28, Rollbar created new firewall rules and resolved the issues with load balancer health thus restoring service for the Platform API & Web Application. By 06:35, all services were fully restored. **Timeline:** * Feb 3 03:37 PST - Both the Platform API and Web Application stop responding * 03:37-05:08 PST - Attempts to remedy through restarts and creating new load balancers fails * 05:08 PST - Critical support ticket created with our cloud support provider * 05:11 PST - Teleconference call initiated with cloud services provider, Google, & Rollbar engineers * 06:28 PST - New firewall rules recommended and added for the Web Application’s ALB * 06:30 PST - Web Application became available * 06:32 PST - New firewall rules recommended and added for the Platform API’s ALB * 06:35 PST - Platform API became available # **Follow-up Actions** To mitigate future risks and avoid similar incidents, we have undertaken the following actions: * In order to avoid the deletion of necessary firewall rules, we have created our own firewall rules rather than relying on automatically-created rules. * We have incorporated notifications on GKE updates into our internal application performance graphs to note when these occur to help in the future when diagnosing issues.
Status: Postmortem
Impact: Critical | Started At: Feb. 3, 2024, 11:51 a.m.
Description: # **Summary of the Incident and Impact** On February 3rd, 2024, between 03:37 and 06:35 PST Rollbar experienced a platform outage affecting the Web Application \([rollbar.com](http://rollbar.com)\) and Platform API \([api.rollbar.com](http://api.rollbar.com)\) servers. The cause of these outages can be traced to an automated update by our Google Cloud Platform to Rollbar’s GKE \(Google Kubernetes Engine\) Clusters. Following this incident, the trailing-12-month uptime of the API tier as measured by our external monitoring service is 99.92%. The upgrade removed firewall rules necessary for health checks originating from Google Cloud Application Load Balancers \(ALBs\) required to be able to send traffic to application servers. Our default network firewall security posture is very strict and removal of rules has significant consequences as we disallow all IP traffic on the relevant ports. The removal of these firewall rules resulted in the inability of workloads on the GKE clusters to communicate with the ALBs thus causing the load balancers to register all workloads as unhealthy. Initially, it was unclear what had happened as no code changes had been deployed by Rollbar nor were changes made directly to any infrastructure. Not knowing that the firewall rules had been eliminated, we attempted to restart applications and create new load balancers from roughly 03:37 to 05:08am. At 05:08 a support ticket was created with Rollbar’s cloud services provider, and Google to help resolve the issue. At 05:11 engineers from the cloud services provider, Google, and Rollbar teleconferenced to try to discuss the issue. After 75 minutes on the support call, the cloud services provider and Google were able to determine that the firewall rules had been removed due to the GKE upgrade. Starting at 06:28, Rollbar created new firewall rules and resolved the issues with load balancer health thus restoring service for the Platform API & Web Application. By 06:35, all services were fully restored. **Timeline:** * Feb 3 03:37 PST - Both the Platform API and Web Application stop responding * 03:37-05:08 PST - Attempts to remedy through restarts and creating new load balancers fails * 05:08 PST - Critical support ticket created with our cloud support provider * 05:11 PST - Teleconference call initiated with cloud services provider, Google, & Rollbar engineers * 06:28 PST - New firewall rules recommended and added for the Web Application’s ALB * 06:30 PST - Web Application became available * 06:32 PST - New firewall rules recommended and added for the Platform API’s ALB * 06:35 PST - Platform API became available # **Follow-up Actions** To mitigate future risks and avoid similar incidents, we have undertaken the following actions: * In order to avoid the deletion of necessary firewall rules, we have created our own firewall rules rather than relying on automatically-created rules. * We have incorporated notifications on GKE updates into our internal application performance graphs to note when these occur to help in the future when diagnosing issues.
Status: Postmortem
Impact: Critical | Started At: Feb. 3, 2024, 11:51 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: Jan. 25, 2024, 3:14 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Critical | Started At: Jan. 8, 2024, 8:56 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.