CircleCI Status: Check if CircleCI down or having an outage.

CircleCI outages and incidents

Outage and incident data over the last 30 days for CircleCI.

There have been 6 outages or incidents for CircleCI in the last 30 days.

Severity Breakdown:

None: 2

Minor: 2

Major: 0

Critical: 2

Tired of searching for status updates?

Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!

Components and Services Monitored for CircleCI

Outlogger tracks the status of these components for Xero:

Artifacts Active

Billing & Account Active

CircleCI Insights Active

CircleCI Releases Active

CircleCI UI Active

CircleCI Webhooks Active

Docker Jobs Active

Machine Jobs Active

macOS Jobs Active

Notifications & Status Updates Active

Pipelines & Workflows Active

Runner Active

Windows Jobs Active

CircleCI Dependencies

AWS Active

Google Cloud Platform Google Cloud DNS Active

Google Cloud Platform Google Cloud Networking Active

Google Cloud Platform Google Cloud Storage Active

Google Cloud Platform Google Compute Engine Active

mailgun API Active

mailgun Outbound Delivery Active

mailgun SMTP Active

OpenAI Active

Upstream Services

Atlassian Bitbucket API Active

Atlassian Bitbucket Source downloads Active

Atlassian Bitbucket SSH Active

Atlassian Bitbucket Webhooks Active

Docker Authentication Active

Docker Hub Active

Docker Registry Active

GitHub API Requests Active

GitHub Git Operations Active

GitHub Packages Active

GitHub Pull Requests Active

GitHub Webhooks Active

GitLab Active

Component	Status
Artifacts	Active
Billing & Account	Active
CircleCI Insights	Active
CircleCI Releases	Active
CircleCI UI	Active
CircleCI Webhooks	Active
Docker Jobs	Active
Machine Jobs	Active
macOS Jobs	Active
Notifications & Status Updates	Active
Pipelines & Workflows	Active
Runner	Active
Windows Jobs	Active
CircleCI Dependencies	Active
AWS	Active
Google Cloud Platform Google Cloud DNS	Active
Google Cloud Platform Google Cloud Networking	Active
Google Cloud Platform Google Cloud Storage	Active
Google Cloud Platform Google Compute Engine	Active
mailgun API	Active
mailgun Outbound Delivery	Active
mailgun SMTP	Active
OpenAI	Active
Upstream Services	Active
Atlassian Bitbucket API	Active
Atlassian Bitbucket Source downloads	Active
Atlassian Bitbucket SSH	Active
Atlassian Bitbucket Webhooks	Active
Docker Authentication	Active
Docker Hub	Active
Docker Registry	Active
GitHub API Requests	Active
GitHub Git Operations	Active
GitHub Packages	Active
GitHub Pull Requests	Active
GitHub Webhooks	Active
GitLab	Active

Latest CircleCI outages and incidents.

View the latest incidents for CircleCI and check for official updates:

Increased wait times for machine jobs

Description: ## Summary: On September 11, 2024, from 10:16 to 21:00 UTC CircleCI customers encountered multiple issues, including delays in starting jobs, slow processing of job outputs and task status messages \(such as the completion of steps and tasks\), dropped workflows, and a rise in infrastructure failures. These problems collectively affected all jobs during this time frame. To address these issues, we worked to stabilize the service responsible for ingesting and serving the step output as well as tracking the start and end times of individual steps \(**step service**\) until 19:00 UTC, at which point it was determined to be in a sustainable state. Despite this progress, delays in starting Mac jobs persisted until 20:30 UTC, largely due to a backlog of jobs waiting to start and a failure to properly garbage collect \(GC\) old virtual machines \(VMs\). This combination of factors contributed to a challenging operational environment for CircleCI customers. The original status page can be found [here](https://status.circleci.com/incidents/lsv2ry3jr16c). ## What Happened \(All times UTC\) On September 11, 2024, for approximately 10 hours, CircleCI experienced significant service disruptions. The incident began at 10:05 when a particularly potent configuration was executed during an internal test. By 10:14, the job ended, but efforts to generate test results led to a spike in memory usage, causing Out of Memory \(OOM\) errors for several internal services. This resulted in failures in processing job submissions and dispatching tasks, which impacted all customer jobs. By 10:16, job starts across all executors had completely failed, as the service responsible for processing and storing test results, as well as handling storage of job records \(**output service**\) became overwhelmed and unable to service requests. An official incident declaration occurred at 10:20. We triggered a deployment restart at 10:23, which initially allowed for some recovery before the service was again overwhelmed at approximately 10:27. To address these issues, we horizontally and vertically scaled the service. This adjustment allowed the service to stabilize and for customer jobs to start flowing again. Throughout the incident, machine jobs faced specific challenges due to timeouts. By 13:00, we detected abnormal resource utilization in step service, prompting us to monitor the situation closely. We believed the ongoing issues were related to a thundering herd effect stemming from an earlier incident. Between 14:47 and 15:05 our efforts to stabilize the system included increasing memory for the service processing step output multiple times throughout this incident in an ongoing attempt to manage the backlog and prevent OOM kills. At 16:21, in order to process the load of the thundering herd of built up workload, we had to raise memory limits in multiple locations in order to allow work to process without causing outages. This marked the beginning of a significant recovery. The existing Redis cluster was under heavy CPU load, prompting a decision at 16:30 to spin up a second Redis cluster to alleviate the pressure. ![](https://global.discourse-cdn.com/circleci/original/3X/c/f/cfbf14664563f62c4d331c0aed80b152ccdc1d5c.png) `Redis Engine CPU Utilization Impact Timeline` By 17:00, the job queue began to decrease significantly as the service stabilized. Throughout the afternoon, we continued to monitor and adjust resources, ultimately doubling Redis shards around 18:11, which had an immediate positive effect on reducing load. During this incident, customers experienced significantly longer response times for API calls from servers running customer workloads reporting back the output of jobs. The 95th percentile \(p95\) response times spiked to between 5 and 15 seconds from 14:20 to 19:50, compared to the usual expectation of around 100 milliseconds. This led to degraded step output on the jobs page, with issues such as delays in displaying the output of the customer steps, missing output of customer steps, and, in some cases, no output of customer steps at all. These delays likely resulted in slower Task performance, as sending step output to the step receiver took longer, blocking other actions within the Tasks. While the average Task runtime increased, the specific impact varied depending on the Task's contents. ![](https://global.discourse-cdn.com/circleci/original/3X/c/6/c600d16b22a0f27de0827edca9d4c55a2e56ea25.png) `Task Wait Time` _Linux:_ * **12:10 - 13:05:** Wait times under 1 minute. * **13:40 - 15:30:** Degraded wait times, generally under 5 minutes. * **15:30 - 18:05:** Wait times increased to tens of minutes, with some recovery starting around 17:45. * **18:05 - 20:00:** Continued degraded wait times of 2-3 minutes. * **20:00:** Fully recovered. _Windows:_ * **12:10 - 15:40:** Degraded wait times, typically under 5 minutes. * **15:40 - 17:15:** Wait times reached tens of minutes. * **17:15 - 19:35:** Returned to degraded wait times, usually under 5 minutes. * **19:35 - 19:55:** Wait times again increased to tens of minutes. * **19:55:** Fully recovered. _Mac OS:_ * **12:10 - 15:30:** Degraded wait times, generally under 5 minutes. * **15:30 - 21:00:** Wait times escalated to tens of minutes. * **21:00:** Fully recovered. ## Future Prevention and Process Improvement: In response to this incident, we are implementing several key improvements to enhance service reliability. First, we will enhance how tasks are cleared during infrastructure failures, which will help streamline operations. We will put in guardrails in the system to prevent execution of pathological workloads. Additionally, we will implement a mechanism that will allow us to temporarily prevent jobs that have failed due to infrastructure issues from being retried. We will adjust Redis health checks by moving them from liveness probes to readiness probes, and we plan to increase the number of Redis shards to better distribute load and minimize the impact of a single shard being blocked. During our investigation we identified a self-reenforcing cycle of poorly performing Redis commands \(scans\) that were the root cause of Redis failing, and we’re going to address this as well. To enhance stability, we will introduce a timeout for step data based on job maximum runtime and reduce the pressure a single job can place on the S3 connection pool. We are also looking to develop a method to pause live deployments from the CircleCI app during incidents, ensuring that delayed changes do not overwrite manual adjustments made in the interim.

Status: Postmortem

Impact: Major | Started At: Sept. 11, 2024, 2:29 p.m.

Updates:

Time: Sept. 20, 2024, 6:29 p.m.

Status: Postmortem

Update: ## Summary: On September 11, 2024, from 10:16 to 21:00 UTC CircleCI customers encountered multiple issues, including delays in starting jobs, slow processing of job outputs and task status messages \(such as the completion of steps and tasks\), dropped workflows, and a rise in infrastructure failures. These problems collectively affected all jobs during this time frame. To address these issues, we worked to stabilize the service responsible for ingesting and serving the step output as well as tracking the start and end times of individual steps \(**step service**\) until 19:00 UTC, at which point it was determined to be in a sustainable state. Despite this progress, delays in starting Mac jobs persisted until 20:30 UTC, largely due to a backlog of jobs waiting to start and a failure to properly garbage collect \(GC\) old virtual machines \(VMs\). This combination of factors contributed to a challenging operational environment for CircleCI customers. The original status page can be found [here](https://status.circleci.com/incidents/lsv2ry3jr16c). ## What Happened \(All times UTC\) On September 11, 2024, for approximately 10 hours, CircleCI experienced significant service disruptions. The incident began at 10:05 when a particularly potent configuration was executed during an internal test. By 10:14, the job ended, but efforts to generate test results led to a spike in memory usage, causing Out of Memory \(OOM\) errors for several internal services. This resulted in failures in processing job submissions and dispatching tasks, which impacted all customer jobs. By 10:16, job starts across all executors had completely failed, as the service responsible for processing and storing test results, as well as handling storage of job records \(**output service**\) became overwhelmed and unable to service requests. An official incident declaration occurred at 10:20. We triggered a deployment restart at 10:23, which initially allowed for some recovery before the service was again overwhelmed at approximately 10:27. To address these issues, we horizontally and vertically scaled the service. This adjustment allowed the service to stabilize and for customer jobs to start flowing again. Throughout the incident, machine jobs faced specific challenges due to timeouts. By 13:00, we detected abnormal resource utilization in step service, prompting us to monitor the situation closely. We believed the ongoing issues were related to a thundering herd effect stemming from an earlier incident. Between 14:47 and 15:05 our efforts to stabilize the system included increasing memory for the service processing step output multiple times throughout this incident in an ongoing attempt to manage the backlog and prevent OOM kills. At 16:21, in order to process the load of the thundering herd of built up workload, we had to raise memory limits in multiple locations in order to allow work to process without causing outages. This marked the beginning of a significant recovery. The existing Redis cluster was under heavy CPU load, prompting a decision at 16:30 to spin up a second Redis cluster to alleviate the pressure. ![](https://global.discourse-cdn.com/circleci/original/3X/c/f/cfbf14664563f62c4d331c0aed80b152ccdc1d5c.png) `Redis Engine CPU Utilization Impact Timeline` By 17:00, the job queue began to decrease significantly as the service stabilized. Throughout the afternoon, we continued to monitor and adjust resources, ultimately doubling Redis shards around 18:11, which had an immediate positive effect on reducing load. During this incident, customers experienced significantly longer response times for API calls from servers running customer workloads reporting back the output of jobs. The 95th percentile \(p95\) response times spiked to between 5 and 15 seconds from 14:20 to 19:50, compared to the usual expectation of around 100 milliseconds. This led to degraded step output on the jobs page, with issues such as delays in displaying the output of the customer steps, missing output of customer steps, and, in some cases, no output of customer steps at all. These delays likely resulted in slower Task performance, as sending step output to the step receiver took longer, blocking other actions within the Tasks. While the average Task runtime increased, the specific impact varied depending on the Task's contents. ![](https://global.discourse-cdn.com/circleci/original/3X/c/6/c600d16b22a0f27de0827edca9d4c55a2e56ea25.png) `Task Wait Time` _Linux:_ * **12:10 - 13:05:** Wait times under 1 minute. * **13:40 - 15:30:** Degraded wait times, generally under 5 minutes. * **15:30 - 18:05:** Wait times increased to tens of minutes, with some recovery starting around 17:45. * **18:05 - 20:00:** Continued degraded wait times of 2-3 minutes. * **20:00:** Fully recovered. _Windows:_ * **12:10 - 15:40:** Degraded wait times, typically under 5 minutes. * **15:40 - 17:15:** Wait times reached tens of minutes. * **17:15 - 19:35:** Returned to degraded wait times, usually under 5 minutes. * **19:35 - 19:55:** Wait times again increased to tens of minutes. * **19:55:** Fully recovered. _Mac OS:_ * **12:10 - 15:30:** Degraded wait times, generally under 5 minutes. * **15:30 - 21:00:** Wait times escalated to tens of minutes. * **21:00:** Fully recovered. ## Future Prevention and Process Improvement: In response to this incident, we are implementing several key improvements to enhance service reliability. First, we will enhance how tasks are cleared during infrastructure failures, which will help streamline operations. We will put in guardrails in the system to prevent execution of pathological workloads. Additionally, we will implement a mechanism that will allow us to temporarily prevent jobs that have failed due to infrastructure issues from being retried. We will adjust Redis health checks by moving them from liveness probes to readiness probes, and we plan to increase the number of Redis shards to better distribute load and minimize the impact of a single shard being blocked. During our investigation we identified a self-reenforcing cycle of poorly performing Redis commands \(scans\) that were the root cause of Redis failing, and we’re going to address this as well. To enhance stability, we will introduce a timeout for step data based on job maximum runtime and reduce the pressure a single job can place on the S3 connection pool. We are also looking to develop a method to pause live deployments from the CircleCI app during incidents, ensuring that delayed changes do not overwrite manual adjustments made in the interim.
Time: Sept. 11, 2024, 8:52 p.m.

Status: Resolved

Update: This incident has been resolved.
Time: Sept. 11, 2024, 8:36 p.m.

Status: Monitoring

Update: Machine, MacOS and Windows jobs are now within normal operating parameters. We will continue to monitor.
Time: Sept. 11, 2024, 6:55 p.m.

Status: Identified

Update: Machine and Windows jobs are now within normal operating parameters. Wait times continue to be longer than the expected normal for MacOS jobs as we work through the backlog of jobs. We thank you for your patience while our engineers continue to work towards mitigation.
Time: Sept. 11, 2024, 6:19 p.m.

Status: Identified

Update: We are continuing to optimize our infrastructure and resource usage due to a slower than expected recovery times. While the wait times have improved they continue to be longer than the expected normal. We thank you for your patience while our engineers continue to work towards mitigation.
Time: Sept. 11, 2024, 5:27 p.m.

Status: Identified

Update: We are beginning to see early signs of recovery with Linux and Windows jobs processing. As the system works through the backlog, you should start to see reduced wait times. Our engineers continue to monitor our resources closely to ensure a steady improvement in our systems. macOS jobs may see a longer recovery period. We thank you for your patience during this time as we work towards a complete mitigation.
Time: Sept. 11, 2024, 4:44 p.m.

Status: Investigating

Update: We are continuing to investigate and mitigate the high resource pressure on our platform. The mitigating efforts may lead to some steps within the running jobs having an empty output or being skipped. We thank you for your patience while our engineers work to bring the platform to a healthy state
Time: Sept. 11, 2024, 4:15 p.m.

Status: Investigating

Update: We are continuing to alleviate the resource pressure our platform is currently experiencing. We thank you for your patience while we work to mitigate the issue causing high wait times on multiple job types
Time: Sept. 11, 2024, 3:31 p.m.

Status: Investigating

Update: Our engineers continue to investigate the issue and are working on changes to our infrastructure to relieve the pressure caused by high resource usage. We thank you for your patience while we work to reduce these high queue times.
Time: Sept. 11, 2024, 2:54 p.m.

Status: Investigating

Update: We are continuing to investigate this issue and are noticing high pressure on other job types at this time.
Time: Sept. 11, 2024, 2:51 p.m.

Status: Investigating

Update: Our engineers are continuing to investigate the issue causing high utilization of our resources and higher wait times for machine and other job types. We thank you for your patience as we work to mitigate this issue.
Time: Sept. 11, 2024, 2:29 p.m.

Status: Investigating

Update: Our engineers are currently investigating an issue causing longer wait times for machine jobs running on the platform. We will provide further updates as more information becomes available.

Failures starting jobs

Description: This incident has been resolved. Some workflows impacted during this incident won't be able to finish, those will need to be rerun.

Status: Resolved

Impact: Major | Started At: Sept. 11, 2024, 10:27 a.m.

Updates:

Time: Sept. 11, 2024, 11:30 a.m.

Status: Resolved

Update: This incident has been resolved. Some workflows impacted during this incident won't be able to finish, those will need to be rerun.
Time: Sept. 11, 2024, 11:18 a.m.

Status: Monitoring

Update: Jobs are starting normally.
Time: Sept. 11, 2024, 11 a.m.

Status: Investigating

Update: We are starting to see recovery while we continue to investigate.
Time: Sept. 11, 2024, 10:29 a.m.

Status: Investigating

Update: We are continuing to investigate this issue.
Time: Sept. 11, 2024, 10:27 a.m.

Status: Investigating

Update: We have identified an issue preventing jobs from starting. We are investigating the cause

Customers unable to view pipelines pages

Description: This incident has been resolved.

Status: Resolved

Impact: Major | Started At: Sept. 11, 2024, 8:10 a.m.

Updates:

Time: Sept. 11, 2024, 9:02 a.m.

Status: Resolved

Update: This incident has been resolved.
Time: Sept. 11, 2024, 8:52 a.m.

Status: Monitoring

Update: We have restored all functionality to the UI, API and GitHub Checks. We are monitoring as service is fully restored.
Time: Sept. 11, 2024, 8:32 a.m.

Status: Identified

Update: We are continuing work to resolve this issue. We have also identified that this impacts GitHub Checks. Jobs will still continue to run.
Time: Sept. 11, 2024, 8:10 a.m.

Status: Identified

Update: We have identified an issue with loading pipeline pages and API endpoints. We are working to resolve it. Jobs will continue to run, but it may not be possible to view them.

Customers are unable to view Releases information for their plans

Description: This incident has been resolved.

Status: Resolved

Impact: Major | Started At: Aug. 21, 2024, 5:54 p.m.

Updates:

Time: Aug. 21, 2024, 6:28 p.m.

Status: Resolved

Update: This incident has been resolved.
Time: Aug. 21, 2024, 6:21 p.m.

Status: Monitoring

Update: A fix has been implemented and we are monitoring the results.
Time: Aug. 21, 2024, 5:54 p.m.

Status: Identified

Update: The issue has been identified and a fix is being implemented.

Delays starting jobs on `m1.large` and `m1.medium` resource classes

Description: Wait times for MacOS jobs have returned to normal. If you have any questions or require additional assistance please reach out to our customer support team.

Status: Resolved

Impact: Minor | Started At: Aug. 15, 2024, 5:48 p.m.

Updates:

Time: Aug. 15, 2024, 6:35 p.m.

Status: Resolved

Update: Wait times for MacOS jobs have returned to normal. If you have any questions or require additional assistance please reach out to our customer support team.
Time: Aug. 15, 2024, 6:09 p.m.

Status: Monitoring

Update: We are working with our service provider to identify and resolve the cause of the problem. Users may experience longer than normal wait times for some MacOS jobs on an intermittent basis.
Time: Aug. 15, 2024, 5:48 p.m.

Status: Investigating

Update: We are investigating an issue with one of our service providers that is causing longer than normal wait times for MacOS jobs on `m1.large` and `m1.medium` resource classes.

Check the status of similar companies and alternatives to CircleCI

Hudl

Systems Active

OutSystems

Systems Active

Postman

Systems Active

Mendix

Systems Active

DigitalOcean

Issues Detected

Bandwidth

Issues Detected

DataRobot

Systems Active

Grafana Cloud

Systems Active

SmartBear Software

Systems Active

Test IO

Systems Active

Copado Solutions

Systems Active

LaunchDarkly

Systems Active

Frequently Asked Questions - CircleCI

Is there a CircleCI outage?

The current status of CircleCI is: Systems Active

Where can I find the official status page of CircleCI?

The official status page for CircleCI is here

How can I get notified if CircleCI is down or experiencing an outage?

To get notified of any status changes to CircleCI, simply sign up to OutLogger's free monitoring service. OutLogger checks the official status of CircleCI every few minutes and will notify you of any changes. You can veiw the status of all your cloud vendors in one dashboard. Sign up here

What does CircleCI do?

Access top-notch CI/CD for any platform, on our cloud or your own infrastructure, at no cost.

Is there an CircleCI outage?

CircleCI status: Systems Active

CircleCI outages and incidents

There have been 6 outages or incidents for CircleCI in the last 30 days.

Severity Breakdown:

Tired of searching for status updates?

Components and Services Monitored for CircleCI

CircleCI Dependencies

Upstream Services

Latest CircleCI outages and incidents.

Increased wait times for machine jobs

Updates:

Failures starting jobs

Updates:

Customers unable to view pipelines pages

Updates:

Customers are unable to view Releases information for their plans

Updates:

Delays starting jobs on `m1.large` and `m1.medium` resource classes

Updates:

Check the status of similar companies and alternatives to CircleCI

Hudl

OutSystems

Postman

Mendix

DigitalOcean

Bandwidth

DataRobot

Grafana Cloud

SmartBear Software

Test IO

Copado Solutions

LaunchDarkly

Frequently Asked Questions - CircleCI

Is there a CircleCI outage?

Where can I find the official status page of CircleCI?

How can I get notified if CircleCI is down or experiencing an outage?

What does CircleCI do?

Start monitoring now!