Last checked: 3 minutes ago
Get notified about any outages, downtime or incidents for CircleCI and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for CircleCI.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Artifacts | Active |
Billing & Account | Active |
CircleCI Insights | Active |
CircleCI Releases | Active |
CircleCI UI | Active |
CircleCI Webhooks | Active |
Docker Jobs | Active |
Machine Jobs | Active |
macOS Jobs | Active |
Notifications & Status Updates | Active |
Pipelines & Workflows | Active |
Runner | Active |
Windows Jobs | Active |
CircleCI Dependencies | Active |
AWS | Active |
Google Cloud Platform Google Cloud DNS | Active |
Google Cloud Platform Google Cloud Networking | Active |
Google Cloud Platform Google Cloud Storage | Active |
Google Cloud Platform Google Compute Engine | Active |
mailgun API | Active |
mailgun Outbound Delivery | Active |
mailgun SMTP | Active |
OpenAI | Active |
Upstream Services | Active |
Atlassian Bitbucket API | Active |
Atlassian Bitbucket Source downloads | Active |
Atlassian Bitbucket SSH | Active |
Atlassian Bitbucket Webhooks | Active |
Docker Authentication | Active |
Docker Hub | Active |
Docker Registry | Active |
GitHub API Requests | Active |
GitHub Git Operations | Active |
GitHub Packages | Active |
GitHub Pull Requests | Active |
GitHub Webhooks | Active |
GitLab | Active |
View the latest incidents for CircleCI and check for official updates:
Description: This incident has now been resolved
Status: Resolved
Impact: Minor | Started At: June 28, 2024, 3:07 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Major | Started At: June 24, 2024, 9:16 p.m.
Description: ## Summary On June 21, 2024, code changes were deployed to two services at 16:19 UTC, causing CircleCI customers with actively running workflows to experience errors in the UI when trying to view projects or workflows. This was isolated to the UI and had no impact on builds. Both deployments were reverted by 16:27 UTC, but customers who had started workflows between 16:19 and 16:27 continued to see errors until those workflows completed or until an additional change was deployed at 18:27 UTC. We thank our customers for their patience and understanding as we worked to resolve this incident. ## What Happened All timestamps are UTC. Code changes were deployed to two related services at 16:19, which would allow information about actively running workflows to be processed and provided to the UI by the same service that provides information about completed workflows. These changes were made as part of an ongoing effort to improve reliability and performance. The changes involved deployments to a service that processes workflow events, and to the API service that serves that information to the UI, which had been tested via unit and integration tests. The change to process the active running workflow events was thought to have been disabled via a feature flag. At 16:20, the API service deployment failed and was rolled back automatically, and we began to see data type errors in that service as well as in two services related to the UI. We rolled back the deployment of the service that processes the workflow events at 16:27, but the errors continued. We first ensured that the rollouts were reverted correctly and that the downstream services related to the UI were not using cached data. At 17:09, we discovered that the feature flag to disable the new event processing had been misconfigured. This led to the processing of events between the deployment at 16:19 and the rollback at 16:27. At 17:14, we identified that the data models had not been fully updated to handle the active workflow data in a backwards-compatible way. The system was attempting to serve data that was not compatible with the API or database spec for the actively running workflows that had been processed. While tests had been written for the initial code change, this particular change in the data had been overlooked. Due to the size of the change, it was also overlooked during code review. A PR was created to fix the issue in the API service at 17:18, but there were some issues with several tests that needed to be addressed before merging. This took longer than expected due to test complexity. The change was deployed once all tests were passing at 18:27, resolving the issue. We continued to monitor until 19:40 and then declared the incident resolved. ## Future Prevention and Process Improvement We have updated the service and the tests to properly account for the differences in data between actively running workflows and completed workflows. It was identified during the incident that the tests for these services were overly complex, which added to the time it took to fully resolve the incident, so we are also prioritizing improvements to the tests. While the API service is now updated to handle the events appropriately, the team responsible for providing the workflow events will also be making changes so that the events for actively running workflows include data for the same fields that completed workflows have. We’ve acknowledged that deploying the read and write changes simultaneously created more complexity and added to the size of the pull request for review. We are updating our process to deploy read changes separately from write changes, and to ensure that feature flag functionality is fully documented prior to deployment. We also identified that continuing to focus on fixing the tests during the incident added significant delay to mitigating the issue. We intend on sharing these learnings across the organization to encourage incident commanders and responding engineers to be focused on mitigation, and potentially time boxing solutions if other options exist.
Status: Postmortem
Impact: Major | Started At: June 21, 2024, 4:54 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Critical | Started At: June 20, 2024, 10:28 p.m.
Description: This incident has been resolved. Please reach out to our customer support engineering team if you require assistance.
Status: Resolved
Impact: Minor | Started At: June 12, 2024, 6:27 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.