Last checked: 5 minutes ago
Get notified about any outages, downtime or incidents for CircleCI and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for CircleCI.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Artifacts | Active |
Billing & Account | Active |
CircleCI Insights | Active |
CircleCI Releases | Active |
CircleCI UI | Active |
CircleCI Webhooks | Active |
Docker Jobs | Active |
Machine Jobs | Active |
macOS Jobs | Active |
Notifications & Status Updates | Active |
Pipelines & Workflows | Active |
Runner | Active |
Windows Jobs | Active |
CircleCI Dependencies | Active |
AWS | Active |
Google Cloud Platform Google Cloud DNS | Active |
Google Cloud Platform Google Cloud Networking | Active |
Google Cloud Platform Google Cloud Storage | Active |
Google Cloud Platform Google Compute Engine | Active |
mailgun API | Active |
mailgun Outbound Delivery | Active |
mailgun SMTP | Active |
OpenAI | Active |
Upstream Services | Active |
Atlassian Bitbucket API | Active |
Atlassian Bitbucket Source downloads | Active |
Atlassian Bitbucket SSH | Active |
Atlassian Bitbucket Webhooks | Active |
Docker Authentication | Active |
Docker Hub | Active |
Docker Registry | Active |
GitHub API Requests | Active |
GitHub Git Operations | Active |
GitHub Packages | Active |
GitHub Pull Requests | Active |
GitHub Webhooks | Active |
GitLab | Active |
View the latest incidents for CircleCI and check for official updates:
Description: Job start times have recovered.
Status: Resolved
Impact: Major | Started At: Oct. 22, 2024, 6 p.m.
Description: ## Summary: On October 22, 2024, from 14:45 to 15:52 and again from 17:41 to 18:22 UTC, CircleCI customers experienced failures on new job submissions as well as failures on jobs that were in progress. A sudden increase in the number of tasks completing simultaneously and requests to upload artifacts from jobs overloaded the service responsible for managing job output. On October 28, 2024, from 13:27 to 14:13 and from 14:58 to 15:50, CircleCI customers experienced a recurrence of these effects due to a similar cause. During these sets of incidents, customers would have experienced their jobs failing to start with an infrastructure failure. Jobs that were already in progress also failed with an infrastructure failure. We want to thank our customers for your patience and understanding as we worked to resolve these incidents. The original status pages for the incidents on October 22 can be found [here](https://status.circleci.com/incidents/6yjv79g764yc) and [here](https://status.circleci.com/incidents/0crxbhkflndc). The status pages for incidents on October 28 can be found [here](https://status.circleci.com/incidents/xk37ycndxbhc) and [here](https://status.circleci.com/incidents/8ktdwlsf2lm8). ## What Happened: \(All times UTC\) On October 22, 2024, at 14:45 there was a sudden increase of customer tasks completing at the same time within CircleCI. In order to record each of these task end events, including the amount of storage the task used, the system that manages task state \(distributor\) made calls to our internal API gateway, which subsequently queried the system responsible for storing job output \(output service\). At this point, output service became overwhelmed with requests; although some requests were handled successfully, the vast majority were delayed before finally receiving a `499 Client Closed Request` error response. ![](https://global.discourse-cdn.com/circleci/original/3X/2/b/2b68322aaf27124eb5ae63a15bc0f8f2118c3f7b.png) `Distributor task end calls to the internal API gateway` Additionally, at 14:50, output service received an influx of artifact upload requests, further straining resources in the service. An incident was officially declared at 14:57. Output service was scaled horizontally at 15:16 to handle the additional load it was receiving. Internal health checks began to recover at 15:25, and we continued to monitor output service until incoming requests returned to normal levels. The incident was resolved at 15:52 and we kept output service horizontally scaled. At 17:41, output service received another sharp increase in requests to upload artifacts and was unable to keep up with the additional load, causing jobs to fail again. An incident was declared at 17:57. Because output service was still horizontally scaled from the initial incident, it automatically recovered by 18:00. As a proactive measure, we further scaled output service horizontally at 18:02. We continued to monitor our systems until the incident was resolved at 18:22. Following incident resolution, we continued our investigation and uncovered on October 25 that our internal API gateway was configured with low values for the maximum number of connections allowed to each of the services that experienced increased load on October 22. We immediately increased these values so that the gateway could handle increased volume of task end events moving forward. Despite these improvements, on October 28, 2024, at 13:27, customer jobs started to fail in the same way as they previously did on October 22. An incident was officially declared at 13:38. By 13:48, the system automatically recovered without any intervention and the incident was resolved at 14:13. We continued to investigate the root cause of the delays and failures, but at 14:45 customer jobs started to fail again in the same way. We declared another incident at 14:50. In order to reduce the load on output service, we removed the retry logic when requesting storage used per task from output service. This allowed tasks to complete even if storage used could not be retrieved \(to the customer’s benefit\). Additionally, we scaled distributor horizontally at 15:19 in order to handle the increased load. At 15:21, our systems began to recover. We continued to monitor and resolved the incident at 15:51. We returned to our investigation into the root cause of this recurring behavior and discovered that there was an additional client in distributor that was configured with a low value for maximum number of connections to our internal API gateway. We increased this value at 17:33. ## Future Prevention and Process Improvement: Following the remediation on October 28, we conducted an audit of **all** of the HTTP clients in the execution environment and proactively increased those that were similarly configured to ones in the internal API gateway and distributor. Additionally, we identified a gap in observability with these HTTP clients that prevented us from identifying the root cause of these sets of incidents sooner. We immediately added additional observability to all of the clients in order to enable better alerting if connections pools were to become exhausted again in the future.
Status: Postmortem
Impact: Major | Started At: Oct. 22, 2024, 3:02 p.m.
Description: No other reports have been received and unable to reproduce. Investigation concluded.
Status: Resolved
Impact: None | Started At: Oct. 21, 2024, 6:19 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Major | Started At: Sept. 27, 2024, 8:45 p.m.
Description: The incident where a performance problem prevented some pipelines from being created properly has now been resolved. We appreciate your patience and understanding as we worked through this incident. Please reach out to our support team for any further questions or if you experience any further issues.
Status: Resolved
Impact: Minor | Started At: Sept. 19, 2024, 7:05 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.