Last checked: 8 minutes ago
Get notified about any outages, downtime or incidents for CircleCI and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for CircleCI.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Artifacts | Active |
Billing & Account | Active |
CircleCI Insights | Active |
CircleCI Releases | Active |
CircleCI UI | Active |
CircleCI Webhooks | Active |
Docker Jobs | Active |
Machine Jobs | Active |
macOS Jobs | Active |
Notifications & Status Updates | Active |
Pipelines & Workflows | Active |
Runner | Active |
Windows Jobs | Active |
CircleCI Dependencies | Active |
AWS | Active |
Google Cloud Platform Google Cloud DNS | Active |
Google Cloud Platform Google Cloud Networking | Active |
Google Cloud Platform Google Cloud Storage | Active |
Google Cloud Platform Google Compute Engine | Active |
mailgun API | Active |
mailgun Outbound Delivery | Active |
mailgun SMTP | Active |
OpenAI | Active |
Upstream Services | Active |
Atlassian Bitbucket API | Active |
Atlassian Bitbucket Source downloads | Active |
Atlassian Bitbucket SSH | Active |
Atlassian Bitbucket Webhooks | Active |
Docker Authentication | Active |
Docker Hub | Active |
Docker Registry | Active |
GitHub API Requests | Active |
GitHub Git Operations | Active |
GitHub Packages | Active |
GitHub Pull Requests | Active |
GitHub Webhooks | Active |
GitLab | Active |
View the latest incidents for CircleCI and check for official updates:
Description: ## Summary: On September 11, 2024, from 10:16 to 21:00 UTC CircleCI customers encountered multiple issues, including delays in starting jobs, slow processing of job outputs and task status messages \(such as the completion of steps and tasks\), dropped workflows, and a rise in infrastructure failures. These problems collectively affected all jobs during this time frame. To address these issues, we worked to stabilize the service responsible for ingesting and serving the step output as well as tracking the start and end times of individual steps \(**step service**\) until 19:00 UTC, at which point it was determined to be in a sustainable state. Despite this progress, delays in starting Mac jobs persisted until 20:30 UTC, largely due to a backlog of jobs waiting to start and a failure to properly garbage collect \(GC\) old virtual machines \(VMs\). This combination of factors contributed to a challenging operational environment for CircleCI customers. The original status page can be found [here](https://status.circleci.com/incidents/lsv2ry3jr16c). ## What Happened \(All times UTC\) On September 11, 2024, for approximately 10 hours, CircleCI experienced significant service disruptions. The incident began at 10:05 when a particularly potent configuration was executed during an internal test. By 10:14, the job ended, but efforts to generate test results led to a spike in memory usage, causing Out of Memory \(OOM\) errors for several internal services. This resulted in failures in processing job submissions and dispatching tasks, which impacted all customer jobs. By 10:16, job starts across all executors had completely failed, as the service responsible for processing and storing test results, as well as handling storage of job records \(**output service**\) became overwhelmed and unable to service requests. An official incident declaration occurred at 10:20. We triggered a deployment restart at 10:23, which initially allowed for some recovery before the service was again overwhelmed at approximately 10:27. To address these issues, we horizontally and vertically scaled the service. This adjustment allowed the service to stabilize and for customer jobs to start flowing again. Throughout the incident, machine jobs faced specific challenges due to timeouts. By 13:00, we detected abnormal resource utilization in step service, prompting us to monitor the situation closely. We believed the ongoing issues were related to a thundering herd effect stemming from an earlier incident. Between 14:47 and 15:05 our efforts to stabilize the system included increasing memory for the service processing step output multiple times throughout this incident in an ongoing attempt to manage the backlog and prevent OOM kills. At 16:21, in order to process the load of the thundering herd of built up workload, we had to raise memory limits in multiple locations in order to allow work to process without causing outages. This marked the beginning of a significant recovery. The existing Redis cluster was under heavy CPU load, prompting a decision at 16:30 to spin up a second Redis cluster to alleviate the pressure. ![](https://global.discourse-cdn.com/circleci/original/3X/c/f/cfbf14664563f62c4d331c0aed80b152ccdc1d5c.png) `Redis Engine CPU Utilization Impact Timeline` By 17:00, the job queue began to decrease significantly as the service stabilized. Throughout the afternoon, we continued to monitor and adjust resources, ultimately doubling Redis shards around 18:11, which had an immediate positive effect on reducing load. During this incident, customers experienced significantly longer response times for API calls from servers running customer workloads reporting back the output of jobs. The 95th percentile \(p95\) response times spiked to between 5 and 15 seconds from 14:20 to 19:50, compared to the usual expectation of around 100 milliseconds. This led to degraded step output on the jobs page, with issues such as delays in displaying the output of the customer steps, missing output of customer steps, and, in some cases, no output of customer steps at all. These delays likely resulted in slower Task performance, as sending step output to the step receiver took longer, blocking other actions within the Tasks. While the average Task runtime increased, the specific impact varied depending on the Task's contents. ![](https://global.discourse-cdn.com/circleci/original/3X/c/6/c600d16b22a0f27de0827edca9d4c55a2e56ea25.png) `Task Wait Time` _Linux:_ * **12:10 - 13:05:** Wait times under 1 minute. * **13:40 - 15:30:** Degraded wait times, generally under 5 minutes. * **15:30 - 18:05:** Wait times increased to tens of minutes, with some recovery starting around 17:45. * **18:05 - 20:00:** Continued degraded wait times of 2-3 minutes. * **20:00:** Fully recovered. _Windows:_ * **12:10 - 15:40:** Degraded wait times, typically under 5 minutes. * **15:40 - 17:15:** Wait times reached tens of minutes. * **17:15 - 19:35:** Returned to degraded wait times, usually under 5 minutes. * **19:35 - 19:55:** Wait times again increased to tens of minutes. * **19:55:** Fully recovered. _Mac OS:_ * **12:10 - 15:30:** Degraded wait times, generally under 5 minutes. * **15:30 - 21:00:** Wait times escalated to tens of minutes. * **21:00:** Fully recovered. ## Future Prevention and Process Improvement: In response to this incident, we are implementing several key improvements to enhance service reliability. First, we will enhance how tasks are cleared during infrastructure failures, which will help streamline operations. We will put in guardrails in the system to prevent execution of pathological workloads. Additionally, we will implement a mechanism that will allow us to temporarily prevent jobs that have failed due to infrastructure issues from being retried. We will adjust Redis health checks by moving them from liveness probes to readiness probes, and we plan to increase the number of Redis shards to better distribute load and minimize the impact of a single shard being blocked. During our investigation we identified a self-reenforcing cycle of poorly performing Redis commands \(scans\) that were the root cause of Redis failing, and we’re going to address this as well. To enhance stability, we will introduce a timeout for step data based on job maximum runtime and reduce the pressure a single job can place on the S3 connection pool. We are also looking to develop a method to pause live deployments from the CircleCI app during incidents, ensuring that delayed changes do not overwrite manual adjustments made in the interim.
Status: Postmortem
Impact: Major | Started At: Sept. 11, 2024, 2:29 p.m.
Description: This incident has been resolved. Some workflows impacted during this incident won't be able to finish, those will need to be rerun.
Status: Resolved
Impact: Major | Started At: Sept. 11, 2024, 10:27 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Major | Started At: Sept. 11, 2024, 8:10 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Major | Started At: Aug. 21, 2024, 5:54 p.m.
Description: Wait times for MacOS jobs have returned to normal. If you have any questions or require additional assistance please reach out to our customer support team.
Status: Resolved
Impact: Minor | Started At: Aug. 15, 2024, 5:48 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.