Last checked: 2 minutes ago
Get notified about any outages, downtime or incidents for Harness and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Harness.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Service Reliability Management - Error Tracking FirstGen (fka OverOps) | Active |
Software Engineering Insights FirstGen (fka Propelo) | Active |
Prod 1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 2 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 3 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 4 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Prod Eu1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
View the latest incidents for Harness and check for official updates:
Description: ## Overview Mac Builds started failing during initialize step due to a timeout on trying to connect to the mac cloud provider. ## Timeline \(PST\) | **Time** | **Event** | | --- | --- | | 12:08 PM IST OCT 16 2023 | Engineering got message from a customer about Mac Builds failing via Zendesk. | | 12:35 PM IST OCT 16 2023 | Internal FireHydrant Incident created to check the failed pipelines for customers. | | 12:40 PM IST OCT 16 2023 | Bounced dlite deployment with config changes to remove mac pool in prod2 environment. | ## Resolution Bounced dlite deployment with config changes to remove mac pool in the impacted environment. ## Affected Users This impacted the customers in the prod2 environment using the hosted mac-builds. ## RCA As part of a prior incident we added a set of additional NAT IP’s. These NAT IP’s were not whitelisted in another project which is used to initialize the MAC VMs. The whitelisting was only required for the project which is used to initialize MAC builds because we use a different cloud provider for MAC when compared to linux/windows.
Status: Postmortem
Impact: Minor | Started At: Oct. 16, 2023, 4:48 p.m.
Description: # Overview A few customers using the production environment \(Prod2\) reported encountering a "401 - Failed to fetch error" when attempting to access the Harness User Interface \(UI\). Notably, these customers observed that they could successfully log in and access the Harness platform using an incognito window. ## Timeline \(PST\) | **Time** | **Event** | | --- | --- | | 7:02 AM | Incident reported by customers | | 7:10 AM | The team executed a rollback of the recent deployment in the Prod2 environment, resulting in the successful resolution of the incident. | | 7:11 AM | Monitoring | | 7:41 AM | The issue has been confirmed as resolved | ## Resolution We initiated a rollback procedure, reverting the deployment from 810xx to 809xx within the Prod2 environment. ## Affected Users Users in Prod2 whose tokens had expired over the weekend. ## RCA Users encountered the "Failed to fetch: 401" error due to their session tokens expiring, leading to a 401 Unauthorized response from the Gateway. While typically, this should have redirected users to the login page, they remained on the same page because the 401 response was not handled by the UI with the recent deployment. We mitigated the incident by rolling back to the previously deployed version in the Prod2 environment. ## Action Items * We will ensure that any reverts are isolated and not combined with additional changes in the same Pull Request \(PR\) to prevent similar issues. * We will enhance our UI Automation by incorporating a critical test case to confirm that all 401 errors consistently redirect users to the login page.
Status: Postmortem
Impact: Minor | Started At: Oct. 16, 2023, 2:02 p.m.
Description: ## Overview TI service received burst cleanup calls at two different times and the cleanup procedure created a huge load on timescale DB used by TI service. The CPU of timescale was running at 100% due to which ping to timescale was failing \(TI service readiness probe\). Since TI service readiness probe was failing, Kubernetes marked the service pods unavailable. ## Impact CI pipelines executions which upload test reports and were started during the below time window may have been impacted. Times: \(2:15 AM - 2:50 AM and 4:45 AM to 5:15 AM\) PT ## Resolution Cleanup requested that were queued got completed and system returned to operational ## Timeline | **Time** | **Event** | **Notes** | | --- | --- | --- | | 10/16/2023 02:16 AM | TI service received burst cleanup requests | | | 4:44 AM | TI service received burst cleanup requests | | | 6:28 AM | TI service 503 error reported | | | 7:45 AM | Discovered that timescale was running at 100% | | | 8:00 AM | Suspected that cleanup jobs were hogging the timescale CPU Platform team confirmed that data deletion job was run due to a bug | Still going through TI service logs to find any other suspects | | 8:30 AM | Correlated and confirmed that the cleanup led to high timescale CPU | Both the readiness probe failures times coincided with the cleanup job requests | | 9:00 AM | Incident was resolved | Action items were discussed for Engg | ## RCA Timescale was not responding to pings which is the readiness probe for TI service. Since TI service readiness probe was unresponsive, Kubernetes marked the service pods unavailable. Timescale DB was running at 100% CPU due to high number of procedure calls made from TI service for periodic cleanup. The DB was already under high utilization due to production workload since all the test reports are stored in this DB. The CG Manager code which runs multiple threads in parallel for periodic cleanup ran on a weekday \(supposed to run on weekends\) due to a bug which sends cleanup events to all services. CI Manager picked up these events and sent burst cleanup API calls to TI service.
Status: Postmortem
Impact: Minor | Started At: Oct. 16, 2023, 9:15 a.m.
Description: This incident is the second instance of this incident [https://status.harness.io/incidents/jbvd0pd0qd2m](https://status.harness.io/incidents/jbvd0pd0qd2m). Please find the RCA here.
Status: Postmortem
Impact: Minor | Started At: Oct. 12, 2023, 7:11 p.m.
Description: # Overview We received a report from one of our customers about issues with their pipeline executions in our Prod-2 cluster. Details about the incident are below. ## Timeline \(PST\) | **Time** | **Event** | | --- | --- | | 7:49 AM | One customer reported issue with their pipeline executions | | 8:45 AM | Engineering got engaged in the investigation | | 9:40 AM | Issue identified and mitigated \(details below\) | ## Resolution We deleted the expired Delegate tasks from the database to unblock the customers. ## Affected accounts A total of seven customers were impacted because of this incident. We will take 60 mins of partial downtime for our CD, CDNG, STO and CIE - Self-Hosted Runners. ## RCA We identified the problem with a background job that periodically cleans up expired tasks. The background job hit a latent bug where it was iterating over the same set of tasks and could not delete them. The issue surfaced due to increased database query latency due to a scheduled database upgrade. We mitigated the incident by manually cleaning up the expired tasks from the database. ## Action Items * Enhance our alerting for this scenario so we can catch such issues early. * Improve the background cleanup job to be resilient to db latencies.
Status: Postmortem
Impact: Minor | Started At: Oct. 11, 2023, 3:08 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.