Last checked: 4 minutes ago
Get notified about any outages, downtime or incidents for Harness and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Harness.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Service Reliability Management - Error Tracking FirstGen (fka OverOps) | Active |
Software Engineering Insights FirstGen (fka Propelo) | Active |
Prod 1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 2 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 3 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 4 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Prod Eu1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
View the latest incidents for Harness and check for official updates:
Description: **Summary:** * On December 22nd, a load test triggered by the Harness Performance Team caused a full outage of [app.harness.io](http://app.harness.io/) for approximately 9 minutes \(12:36 PM UTC to 12:45 PM UTC\). * This impacted customers in Prod1 and Prod2 clusters. * Prod3 cluster was unaffected due to its separate component affected by outage. **Impact:** * Customers in Prod1 and Prod2 clusters were unable to access [app.harness.io](http://app.harness.io/) for 9 minutes. **Root Cause:** * During the high volume of traffic from the load test, the component \("Kubernetes Ingress Controller"\) responsible for managing incoming requests and routing them to the correct internal services became overloaded. * This caused the ingress controller to become unhealthy, leading to the outage. **Resolution:** * The system automatically recovered without manual intervention. **Action Items:** * **Resource Scaling:** We are exploring options to automatically scale the ingress controller based on demand to handle high traffic volumes more effectively. We understand the importance of a reliable platform for your operations and sincerely apologize for any inconvenience caused by this incident. Our team is dedicated to ensuring the continued improvement of the Harness platform’s performance and reliability. We appreciate your trust and remain committed to providing you with a seamless experience.
Status: Postmortem
Impact: Major | Started At: Dec. 22, 2023, 12:30 p.m.
Description: # Summary On 21st December at 1:10 PM PST, we received a report from 2 of our customers about issues with their pipeline executions in our Prod-2 cluster for their CI pipelines. A firehydrant was triggered after some time for the same. ## Timeline \(PST\) | **Time** | **Event** | | --- | --- | | 1:24 PM | Confirmed no pipeline-service deployment was done and issue is observed only for few Prod2 and Prod1 CI customers. | | 1:28 PM | Verified CI Automation was running fine but we were able to reproduce the issue | | 2:04 PM | Prod2 CI service was rolled back to previous version, and confirmed with customers the issue is mitigated | # Resolution We rolled back the CI build in Prod2 cluster to unblock the customers. # Total Downtime * Downtime taken: No Downtime taken * Resolution time\*: 1hour 46 minutes * Resolution time = time reported to time restored, either through Rollback or HF # RCA There was a change in a common deserialiser - where we added handling that if the value is a string of Json list example → `"[1,2]"` is given then it will be converted to List of String irrespective the field expecting it to be type String, thus its throwing Exception during execution. This was mainly observed for a customer having this value set in their envVariables in RunStep in the CI stage. # Action Items 1. Updating our customer setup automation to include this setup as well as any others to have our suite up to date so that with our feature development, existing customer setups are not impacted. 2. Adding failover to code paths when making changes to existing flows to minimize the impact on existing running setups with new feature/bug development.
Status: Postmortem
Impact: Major | Started At: Dec. 21, 2023, 10:05 p.m.
Description: The problem was primarily with the image being used, and we have notified customers to use a stable version of DIND to prevent this issue https://github.com/docker-library/docker/commit/4c2674df4f40c965cdb8ccc77b8ce9dbc247a6c9
Status: Resolved
Impact: Minor | Started At: Dec. 18, 2023, 6:22 p.m.
Description: # Incident On December 11th, starting at 4:25 PM \(All times UTC\), Harness had a service outage that affected pipelines in our Prod2 environment. Specifically, CI and CD pipeline executions in NextGen which used secrets failed. The incident was resolved on December 11, 4:50 PM. This incident is related to the [incident](https://status.harness.io/incidents/w2w7btby70xs) from last week. # Timeline | **Time** | **Event** | | --- | --- | | Dec 11, 4:25 PM | Harness detected pipelines were failing to resolve secrets. | | Dec 11, 4:28 PM | Incident was acknowledged, and the P0 incident called | | Dec 11, 4:35 PM | Root cause identified | | Dec 11, 4:50 PM | Incident resolved | # Root Cause ## Background Harness uses connectors to external secret managers \(e.g. Google Secret Manager or Hashicorp Vault\) to resolve/store secrets used by pipelines and elsewhere in the Harness platform. External secret manager connectors require configuration, including a means to authenticate to the external Secret Manager. On 2023-12-07, there was an incident where a bad secret manager configuration was leading to thread exhaustion. To mitigate that incident, we updated the faulty configuration in the database and restarted the affected services. The instant incident was a downstream of the incident from earlier this week. Mitigation and Remediation * In the prior incident, we manually updated the config that controlled the broken secret manager connector. In this cleanup, we unintentionally left a dangling database entry. Had we updated the connector by API, this entry would have been cleaned up correctly * After discovery, we deleted the secret through API and restarted the affected services. # Followup/Action Items * On Friday, we rolled a hotfix to prevent the creation of such faulty configuration. However, it did not help in this case since it was an existing configuration. * There was additional runtime validation in the works, which detects the self reference when secret is used in pipeline execution. Since then, it has also been rolled out.
Status: Postmortem
Impact: Major | Started At: Dec. 11, 2023, 4:42 p.m.
Description: # Incident On December 7th, starting around 9 PM \(All times UTC\), Harness experienced an outage that affected pipelines in our Prod2 environment. Specifically, CI and CD pipelines in NextGen which used secrets were failing during execution. There was an intermittent downtime for FirstGen pipelines too for the duration of Services restart events. The incident was resolved on December 8th at 3:01 AM. # Timeline | **Time** | **Event** | | --- | --- | | Dec 7, 9:13 PM | First customer reported issue. Triaged as likely a result of a separate ongoing incident. | | Dec 7, 10:34 PM | Incident acknowledged as independent of separate incident, and incident called | | Dec 8, 2:13 AM | Root cause identified | | Dec 8, 3:01 AM | Incident resolved | # Response Performance degradation and execution failure issues were reported across Continuous Integration \(CI\) and Continuous Deployment \(CD\) pipelines starting at 9:13 PM on Dec 7th. A high severity incident was declared at 10:30 PM. # Root Cause ## Background: Harness uses connectors to external secret managers \(e.g. Google Secret Manager or Hashicorp Vault\) to resolve/store secrets used by pipelines and elsewhere in the Harness platform. External secret manager connectors require configuration, including a means to authenticate to the external Secret Manager. ## Sequence of Events * A customer configured their Secret Manager Connector to use a secret stored in the same Secret Manager. This issue was not apparent at the time of the change to either Harness or the User, because validation rules did not catch this issue. * Several hours later a pipeline was run that referenced a secret contained in that secret manager. * The pipeline execution tried to resolve the secret. Secret resolution created a recursive loop, filling the threadpool devoted to secret resolution. * Threadpool exhaustion stalled secret resolution across the environment. End-users experienced this stall as pipeline failures because failed secret resolution fails a pipeline # Mitigation and Remediation Mitigation consisted of: 1. Updating the faulty configuration to break the self-dependency 2. Aborting the affected in-flight pipeline executions 3. Scaling all replicas of the service which manages secret resolution to zero to stop the job from being picked back up by the scheduler. Note that redeploying or restarting the service didn’t fix the issue because any surviving replica would instantly poison the others. A Hotfix has been released to ensure configuration validation includes checking for self-reference. # Followup/Action Items * Improve fault isolation and layering between services in a way that makes causal issues easier to detect. * Our observability systems were operational and functioning normally, however they were not configured to alert on this type of issue. We will be implementing two classes of fixes across the platform: * 1\) Log-volume based alerting. Although this would not have identified the specific issue sooner, it would have decreased time to detection. * 2\) Close the loop between observability metrics and thresholds for alerting on those metrics. As metrics are added, they need to have thresholds for alerting configured at the same time and adjusted as needed, rather than creating metrics, and configuring alerting in a separate workstream. An alert on thread pool size would have greatly reduced the incident resolution time. * Our incident response playbooks include triage steps for individual modules, and steps for fault isolation at the platform level, but didn’t fully cover the scope of actions needed to isolate this issue. We will enhance our playbooks to provide additional depth for platform-level triage. We understand that the Harness platform is mission critical for our customers. We are committed to living up to our promise of reliability and availability. We are determined to learn from this incident and make the necessary improvements to meet our shared world-class standards. Your trust is of utmost importance, and we appreciate your understanding.
Status: Postmortem
Impact: Major | Started At: Dec. 7, 2023, 10:34 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.