Harness Status: Check if Harness down or having an outage.

Harness outages and incidents

Outage and incident data over the last 30 days for Harness.

There have been 3 outages or incidents for Harness in the last 30 days.

Severity Breakdown:

None: 0

Minor: 3

Major: 0

Critical: 0

Tired of searching for status updates?

Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!

Components and Services Monitored for Harness

Outlogger tracks the status of these components for Xero:

Service Reliability Management - Error Tracking FirstGen (fka OverOps) Active

Software Engineering Insights FirstGen (fka Propelo) Active

Prod 1

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery (CD) - FirstGen - EOS Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Software Engineering Insights (SEI) Active

Software Supply Chain Assurance (SSCA) Active

Prod 2

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery (CD) - FirstGen - EOS Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Software Engineering Insights (SEI) Active

Software Supply Chain Assurance (SSCA) Active

Prod 3

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery (CD) - FirstGen - EOS Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Software Supply Chain Assurance (SSCA) Active

Prod 4

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Prod Eu1

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Component	Status
Service Reliability Management - Error Tracking FirstGen (fka OverOps)	Active
Software Engineering Insights FirstGen (fka Propelo)	Active
Prod 1	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery (CD) - FirstGen - EOS	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active
Software Engineering Insights (SEI)	Active
Software Supply Chain Assurance (SSCA)	Active
Prod 2	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery (CD) - FirstGen - EOS	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active
Software Engineering Insights (SEI)	Active
Software Supply Chain Assurance (SSCA)	Active
Prod 3	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery (CD) - FirstGen - EOS	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active
Software Supply Chain Assurance (SSCA)	Active
Prod 4	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active
Prod Eu1	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active

Latest Harness outages and incidents.

View the latest incidents for Harness and check for official updates:

Users from Prod 1 and Prod 2 clusters were unable to access Harness Platform

Description: **Summary:** * On December 22nd, a load test triggered by the Harness Performance Team caused a full outage of [app.harness.io](http://app.harness.io/) for approximately 9 minutes \(12:36 PM UTC to 12:45 PM UTC\). * This impacted customers in Prod1 and Prod2 clusters. * Prod3 cluster was unaffected due to its separate component affected by outage. **Impact:** * Customers in Prod1 and Prod2 clusters were unable to access [app.harness.io](http://app.harness.io/) for 9 minutes. **Root Cause:** * During the high volume of traffic from the load test, the component \("Kubernetes Ingress Controller"\) responsible for managing incoming requests and routing them to the correct internal services became overloaded. * This caused the ingress controller to become unhealthy, leading to the outage. **Resolution:** * The system automatically recovered without manual intervention. **Action Items:** * **Resource Scaling:** We are exploring options to automatically scale the ingress controller based on demand to handle high traffic volumes more effectively. We understand the importance of a reliable platform for your operations and sincerely apologize for any inconvenience caused by this incident. Our team is dedicated to ensuring the continued improvement of the Harness platform’s performance and reliability. We appreciate your trust and remain committed to providing you with a seamless experience.

Status: Postmortem

Impact: Major | Started At: Dec. 22, 2023, 12:30 p.m.

Updates:

Time: Dec. 26, 2023, 6:44 a.m.

Status: Postmortem

Update: **Summary:** * On December 22nd, a load test triggered by the Harness Performance Team caused a full outage of [app.harness.io](http://app.harness.io/) for approximately 9 minutes \(12:36 PM UTC to 12:45 PM UTC\). * This impacted customers in Prod1 and Prod2 clusters. * Prod3 cluster was unaffected due to its separate component affected by outage. **Impact:** * Customers in Prod1 and Prod2 clusters were unable to access [app.harness.io](http://app.harness.io/) for 9 minutes. **Root Cause:** * During the high volume of traffic from the load test, the component \("Kubernetes Ingress Controller"\) responsible for managing incoming requests and routing them to the correct internal services became overloaded. * This caused the ingress controller to become unhealthy, leading to the outage. **Resolution:** * The system automatically recovered without manual intervention. **Action Items:** * **Resource Scaling:** We are exploring options to automatically scale the ingress controller based on demand to handle high traffic volumes more effectively. We understand the importance of a reliable platform for your operations and sincerely apologize for any inconvenience caused by this incident. Our team is dedicated to ensuring the continued improvement of the Harness platform’s performance and reliability. We appreciate your trust and remain committed to providing you with a seamless experience.
Time: Dec. 22, 2023, 1:41 p.m.

Status: Resolved

Update: There was a downtime observed with the Harness platform for the Prod1 and Prod 2 clusters. The login page was not accessible and 502 errors were returned. We are investigating the root cause of the issue and will post RCA here. All functionalities are restored now.

Pipeline initialization failures observed while resolving environment variable expressions.

Description: # Summary On 21st December at 1:10 PM PST, we received a report from 2 of our customers about issues with their pipeline executions in our Prod-2 cluster for their CI pipelines. A firehydrant was triggered after some time for the same. ## Timeline \(PST\) | **Time** | **Event** | | --- | --- | | 1:24 PM | Confirmed no pipeline-service deployment was done and issue is observed only for few Prod2 and Prod1 CI customers. | | 1:28 PM | Verified CI Automation was running fine but we were able to reproduce the issue | | 2:04 PM | Prod2 CI service was rolled back to previous version, and confirmed with customers the issue is mitigated | # Resolution We rolled back the CI build in Prod2 cluster to unblock the customers. # Total Downtime * Downtime taken: No Downtime taken * Resolution time\*: 1hour 46 minutes * Resolution time = time reported to time restored, either through Rollback or HF # RCA There was a change in a common deserialiser - where we added handling that if the value is a string of Json list example → `"[1,2]"` is given then it will be converted to List of String irrespective the field expecting it to be type String, thus its throwing Exception during execution. This was mainly observed for a customer having this value set in their envVariables in RunStep in the CI stage. # Action Items 1. Updating our customer setup automation to include this setup as well as any others to have our suite up to date so that with our feature development, existing customer setups are not impacted. 2. Adding failover to code paths when making changes to existing flows to minimize the impact on existing running setups with new feature/bug development.

Status: Postmortem

Impact: Major | Started At: Dec. 21, 2023, 10:05 p.m.

Updates:

Time: Dec. 26, 2023, 6:13 a.m.

Status: Postmortem

Update: # Summary On 21st December at 1:10 PM PST, we received a report from 2 of our customers about issues with their pipeline executions in our Prod-2 cluster for their CI pipelines. A firehydrant was triggered after some time for the same. ## Timeline \(PST\) | **Time** | **Event** | | --- | --- | | 1:24 PM | Confirmed no pipeline-service deployment was done and issue is observed only for few Prod2 and Prod1 CI customers. | | 1:28 PM | Verified CI Automation was running fine but we were able to reproduce the issue | | 2:04 PM | Prod2 CI service was rolled back to previous version, and confirmed with customers the issue is mitigated | # Resolution We rolled back the CI build in Prod2 cluster to unblock the customers. # Total Downtime * Downtime taken: No Downtime taken * Resolution time\*: 1hour 46 minutes * Resolution time = time reported to time restored, either through Rollback or HF # RCA There was a change in a common deserialiser - where we added handling that if the value is a string of Json list example → `"[1,2]"` is given then it will be converted to List of String irrespective the field expecting it to be type String, thus its throwing Exception during execution. This was mainly observed for a customer having this value set in their envVariables in RunStep in the CI stage. # Action Items 1. Updating our customer setup automation to include this setup as well as any others to have our suite up to date so that with our feature development, existing customer setups are not impacted. 2. Adding failover to code paths when making changes to existing flows to minimize the impact on existing running setups with new feature/bug development.
Time: Dec. 21, 2023, 10:37 p.m.

Status: Resolved

Update: The incident has been resolved. We plan to publish the Postmortem early next week. Our pipeline execution failure rate was less than 1% during this incident. As a result, no downtime was taken.
Time: Dec. 21, 2023, 10:11 p.m.

Status: Monitoring

Update: We have reverted the services back to the previous version and we are monitoring the results.
Time: Dec. 21, 2023, 10:06 p.m.

Status: Investigating

Update: We are continuing to investigate this issue.
Time: Dec. 21, 2023, 10:05 p.m.

Status: Investigating

Update: We are currently investigating this issue.

Pipelines that reference the Docker in Docker (DIND) image are currently failing. We are currently investigating the issue.

Description: The problem was primarily with the image being used, and we have notified customers to use a stable version of DIND to prevent this issue https://github.com/docker-library/docker/commit/4c2674df4f40c965cdb8ccc77b8ce9dbc247a6c9

Status: Resolved

Impact: Minor | Started At: Dec. 18, 2023, 6:22 p.m.

Updates:

Time: Dec. 18, 2023, 7:43 p.m.

Status: Resolved

Update: The problem was primarily with the image being used, and we have notified customers to use a stable version of DIND to prevent this issue https://github.com/docker-library/docker/commit/4c2674df4f40c965cdb8ccc77b8ce9dbc247a6c9
Time: Dec. 18, 2023, 6:52 p.m.

Status: Investigating

Update: We continue to look into the situation. At the moment, two customers are affected, and our team is working on debugging the issue.
Time: Dec. 18, 2023, 6:22 p.m.

Status: Investigating

Update: We are currently investigating this issue.

CD and CI pipelines that reference secrets are experiencing failures.

Description: # Incident On December 11th, starting at 4:25 PM \(All times UTC\), Harness had a service outage that affected pipelines in our Prod2 environment. Specifically, CI and CD pipeline executions in NextGen which used secrets failed. The incident was resolved on December 11, 4:50 PM. This incident is related to the [incident](https://status.harness.io/incidents/w2w7btby70xs) from last week. # Timeline | **Time** | **Event** | | --- | --- | | Dec 11, 4:25 PM | Harness detected pipelines were failing to resolve secrets. | | Dec 11, 4:28 PM | Incident was acknowledged, and the P0 incident called | | Dec 11, 4:35 PM | Root cause identified | | Dec 11, 4:50 PM | Incident resolved | # Root Cause ## Background Harness uses connectors to external secret managers \(e.g. Google Secret Manager or Hashicorp Vault\) to resolve/store secrets used by pipelines and elsewhere in the Harness platform. External secret manager connectors require configuration, including a means to authenticate to the external Secret Manager. On 2023-12-07, there was an incident where a bad secret manager configuration was leading to thread exhaustion. To mitigate that incident, we updated the faulty configuration in the database and restarted the affected services. The instant incident was a downstream of the incident from earlier this week. Mitigation and Remediation * In the prior incident, we manually updated the config that controlled the broken secret manager connector. In this cleanup, we unintentionally left a dangling database entry. Had we updated the connector by API, this entry would have been cleaned up correctly * After discovery, we deleted the secret through API and restarted the affected services. # Followup/Action Items * On Friday, we rolled a hotfix to prevent the creation of such faulty configuration. However, it did not help in this case since it was an existing configuration. * There was additional runtime validation in the works, which detects the self reference when secret is used in pipeline execution. Since then, it has also been rolled out.

Status: Postmortem

Impact: Major | Started At: Dec. 11, 2023, 4:42 p.m.

Updates:

Time: Dec. 13, 2023, 6:59 p.m.

Status: Postmortem

Update: # Incident On December 11th, starting at 4:25 PM \(All times UTC\), Harness had a service outage that affected pipelines in our Prod2 environment. Specifically, CI and CD pipeline executions in NextGen which used secrets failed. The incident was resolved on December 11, 4:50 PM. This incident is related to the [incident](https://status.harness.io/incidents/w2w7btby70xs) from last week. # Timeline | **Time** | **Event** | | --- | --- | | Dec 11, 4:25 PM | Harness detected pipelines were failing to resolve secrets. | | Dec 11, 4:28 PM | Incident was acknowledged, and the P0 incident called | | Dec 11, 4:35 PM | Root cause identified | | Dec 11, 4:50 PM | Incident resolved | # Root Cause ## Background Harness uses connectors to external secret managers \(e.g. Google Secret Manager or Hashicorp Vault\) to resolve/store secrets used by pipelines and elsewhere in the Harness platform. External secret manager connectors require configuration, including a means to authenticate to the external Secret Manager. On 2023-12-07, there was an incident where a bad secret manager configuration was leading to thread exhaustion. To mitigate that incident, we updated the faulty configuration in the database and restarted the affected services. The instant incident was a downstream of the incident from earlier this week. Mitigation and Remediation * In the prior incident, we manually updated the config that controlled the broken secret manager connector. In this cleanup, we unintentionally left a dangling database entry. Had we updated the connector by API, this entry would have been cleaned up correctly * After discovery, we deleted the secret through API and restarted the affected services. # Followup/Action Items * On Friday, we rolled a hotfix to prevent the creation of such faulty configuration. However, it did not help in this case since it was an existing configuration. * There was additional runtime validation in the works, which detects the self reference when secret is used in pipeline execution. Since then, it has also been rolled out.
Time: Dec. 11, 2023, 5:04 p.m.

Status: Resolved

Update: This incident has been resolved.
Time: Dec. 11, 2023, 5:01 p.m.

Status: Monitoring

Update: The issue has been resolved. We are continuing to monitor the incident.
Time: Dec. 11, 2023, 4:42 p.m.

Status: Identified

Update: The issue has been identified and a fix is being implemented.

Pipelines that reference the secrets are experiencing failures in the Prod-2 cluster. Impact - Every pipeline step that refers to a secret will cause the pipeline execution to fail

Description: # Incident On December 7th, starting around 9 PM \(All times UTC\), Harness experienced an outage that affected pipelines in our Prod2 environment. Specifically, CI and CD pipelines in NextGen which used secrets were failing during execution. There was an intermittent downtime for FirstGen pipelines too for the duration of Services restart events. The incident was resolved on December 8th at 3:01 AM. # Timeline | **Time** | **Event** | | --- | --- | | Dec 7, 9:13 PM | First customer reported issue. Triaged as likely a result of a separate ongoing incident. | | Dec 7, 10:34 PM | Incident acknowledged as independent of separate incident, and incident called | | Dec 8, 2:13 AM | Root cause identified | | Dec 8, 3:01 AM | Incident resolved | # Response Performance degradation and execution failure issues were reported across Continuous Integration \(CI\) and Continuous Deployment \(CD\) pipelines starting at 9:13 PM on Dec 7th. A high severity incident was declared at 10:30 PM. # Root Cause ## Background: Harness uses connectors to external secret managers \(e.g. Google Secret Manager or Hashicorp Vault\) to resolve/store secrets used by pipelines and elsewhere in the Harness platform. External secret manager connectors require configuration, including a means to authenticate to the external Secret Manager. ## Sequence of Events * A customer configured their Secret Manager Connector to use a secret stored in the same Secret Manager. This issue was not apparent at the time of the change to either Harness or the User, because validation rules did not catch this issue. * Several hours later a pipeline was run that referenced a secret contained in that secret manager. * The pipeline execution tried to resolve the secret. Secret resolution created a recursive loop, filling the threadpool devoted to secret resolution. * Threadpool exhaustion stalled secret resolution across the environment. End-users experienced this stall as pipeline failures because failed secret resolution fails a pipeline # Mitigation and Remediation Mitigation consisted of: 1. Updating the faulty configuration to break the self-dependency 2. Aborting the affected in-flight pipeline executions 3. Scaling all replicas of the service which manages secret resolution to zero to stop the job from being picked back up by the scheduler. Note that redeploying or restarting the service didn’t fix the issue because any surviving replica would instantly poison the others. A Hotfix has been released to ensure configuration validation includes checking for self-reference. # Followup/Action Items * Improve fault isolation and layering between services in a way that makes causal issues easier to detect. * Our observability systems were operational and functioning normally, however they were not configured to alert on this type of issue. We will be implementing two classes of fixes across the platform: * 1\) Log-volume based alerting. Although this would not have identified the specific issue sooner, it would have decreased time to detection. * 2\) Close the loop between observability metrics and thresholds for alerting on those metrics. As metrics are added, they need to have thresholds for alerting configured at the same time and adjusted as needed, rather than creating metrics, and configuring alerting in a separate workstream. An alert on thread pool size would have greatly reduced the incident resolution time. * Our incident response playbooks include triage steps for individual modules, and steps for fault isolation at the platform level, but didn’t fully cover the scope of actions needed to isolate this issue. We will enhance our playbooks to provide additional depth for platform-level triage. We understand that the Harness platform is mission critical for our customers. We are committed to living up to our promise of reliability and availability. We are determined to learn from this incident and make the necessary improvements to meet our shared world-class standards. Your trust is of utmost importance, and we appreciate your understanding.

Status: Postmortem

Impact: Major | Started At: Dec. 7, 2023, 10:34 p.m.

Updates:

Time: Dec. 9, 2023, 12:18 a.m.

Status: Postmortem

Update: # Incident On December 7th, starting around 9 PM \(All times UTC\), Harness experienced an outage that affected pipelines in our Prod2 environment. Specifically, CI and CD pipelines in NextGen which used secrets were failing during execution. There was an intermittent downtime for FirstGen pipelines too for the duration of Services restart events. The incident was resolved on December 8th at 3:01 AM. # Timeline | **Time** | **Event** | | --- | --- | | Dec 7, 9:13 PM | First customer reported issue. Triaged as likely a result of a separate ongoing incident. | | Dec 7, 10:34 PM | Incident acknowledged as independent of separate incident, and incident called | | Dec 8, 2:13 AM | Root cause identified | | Dec 8, 3:01 AM | Incident resolved | # Response Performance degradation and execution failure issues were reported across Continuous Integration \(CI\) and Continuous Deployment \(CD\) pipelines starting at 9:13 PM on Dec 7th. A high severity incident was declared at 10:30 PM. # Root Cause ## Background: Harness uses connectors to external secret managers \(e.g. Google Secret Manager or Hashicorp Vault\) to resolve/store secrets used by pipelines and elsewhere in the Harness platform. External secret manager connectors require configuration, including a means to authenticate to the external Secret Manager. ## Sequence of Events * A customer configured their Secret Manager Connector to use a secret stored in the same Secret Manager. This issue was not apparent at the time of the change to either Harness or the User, because validation rules did not catch this issue. * Several hours later a pipeline was run that referenced a secret contained in that secret manager. * The pipeline execution tried to resolve the secret. Secret resolution created a recursive loop, filling the threadpool devoted to secret resolution. * Threadpool exhaustion stalled secret resolution across the environment. End-users experienced this stall as pipeline failures because failed secret resolution fails a pipeline # Mitigation and Remediation Mitigation consisted of: 1. Updating the faulty configuration to break the self-dependency 2. Aborting the affected in-flight pipeline executions 3. Scaling all replicas of the service which manages secret resolution to zero to stop the job from being picked back up by the scheduler. Note that redeploying or restarting the service didn’t fix the issue because any surviving replica would instantly poison the others. A Hotfix has been released to ensure configuration validation includes checking for self-reference. # Followup/Action Items * Improve fault isolation and layering between services in a way that makes causal issues easier to detect. * Our observability systems were operational and functioning normally, however they were not configured to alert on this type of issue. We will be implementing two classes of fixes across the platform: * 1\) Log-volume based alerting. Although this would not have identified the specific issue sooner, it would have decreased time to detection. * 2\) Close the loop between observability metrics and thresholds for alerting on those metrics. As metrics are added, they need to have thresholds for alerting configured at the same time and adjusted as needed, rather than creating metrics, and configuring alerting in a separate workstream. An alert on thread pool size would have greatly reduced the incident resolution time. * Our incident response playbooks include triage steps for individual modules, and steps for fault isolation at the platform level, but didn’t fully cover the scope of actions needed to isolate this issue. We will enhance our playbooks to provide additional depth for platform-level triage. We understand that the Harness platform is mission critical for our customers. We are committed to living up to our promise of reliability and availability. We are determined to learn from this incident and make the necessary improvements to meet our shared world-class standards. Your trust is of utmost importance, and we appreciate your understanding.
Time: Dec. 8, 2023, 3:18 a.m.

Status: Resolved

Update: This incident has been resolved, the impacted components were Prod 2 - Continuous Delivery - Next Generation (CDNG) and Continuous Integration Enterprise(CIE) - Self Hosted Runners.
Time: Dec. 8, 2023, 3:03 a.m.

Status: Monitoring

Update: We are continuing to monitor for any further issues.
Time: Dec. 8, 2023, 3:01 a.m.

Status: Monitoring

Update: The incident is now resolved. Detailed RCA to follow.
Time: Dec. 8, 2023, 2:13 a.m.

Status: Identified

Update: We have identified the root cause and we are in the process of recovering.
Time: Dec. 8, 2023, 2:08 a.m.

Status: Identified

Update: The secret decryption task is failing and we are looking into a recovery
Time: Dec. 8, 2023, 1:44 a.m.

Status: Identified

Update: We are rolling back the services to the previously deployed version. We will keep you updated on the progress.
Time: Dec. 8, 2023, 12:23 a.m.

Status: Identified

Update: We are currently working on debugging the issue. We have identified that there may be a problem with the gRPC calls between services. We will keep you updated on the progress.
Time: Dec. 7, 2023, 11:44 p.m.

Status: Identified

Update: We are currently in the process of identifying the incident. As soon as it is identified, we will provide an update.
Time: Dec. 7, 2023, 10:36 p.m.

Status: Identified

Update: We are continuing to work on a fix for this issue.
Time: Dec. 7, 2023, 10:35 p.m.

Status: Identified

Update: We continue to look into the issue and are considering rolling back the latest deployment
Time: Dec. 7, 2023, 10:34 p.m.

Status: Investigating

Update: Pipelines that reference the secrets are experiencing failures in the Prod-2 cluster starting around 9:13 PM UTC. Harness team started looking into the issue and a high severity incident was declared at 10:34 PM UTC.

Check the status of similar companies and alternatives to Harness

UiPath

Systems Active

Scale AI

Systems Active

Notion

Systems Active

Brandwatch

Systems Active

Olive AI

Systems Active

Sisense

Systems Active

HeyJobs

Systems Active

Joveo

Systems Active

Seamless AI

Systems Active

EdCast by Cornerstone

Systems Active

hireEZ

Systems Active

Alchemy

Systems Active

Frequently Asked Questions - Harness

Is there a Harness outage?

The current status of Harness is: Systems Active

Where can I find the official status page of Harness?

The official status page for Harness is here

How can I get notified if Harness is down or experiencing an outage?

To get notified of any status changes to Harness, simply sign up to OutLogger's free monitoring service. OutLogger checks the official status of Harness every few minutes and will notify you of any changes. You can veiw the status of all your cloud vendors in one dashboard. Sign up here

What does Harness do?

Harness is a software delivery platform that enables engineers and DevOps to build, test, deploy, and verify software as needed.

Is there an Harness outage?

Harness status: Systems Active

Harness outages and incidents

There have been 3 outages or incidents for Harness in the last 30 days.

Severity Breakdown:

Tired of searching for status updates?

Components and Services Monitored for Harness

Prod 1

Prod 2

Prod 3

Prod 4

Prod Eu1

Latest Harness outages and incidents.

Users from Prod 1 and Prod 2 clusters were unable to access Harness Platform

Updates:

Pipeline initialization failures observed while resolving environment variable expressions.

Updates:

Pipelines that reference the Docker in Docker (DIND) image are currently failing. We are currently investigating the issue.

Updates:

CD and CI pipelines that reference secrets are experiencing failures.

Updates:

Pipelines that reference the secrets are experiencing failures in the Prod-2 cluster. Impact - Every pipeline step that refers to a secret will cause the pipeline execution to fail

Updates:

Check the status of similar companies and alternatives to Harness

Frequently Asked Questions - Harness

Is there a Harness outage?

Where can I find the official status page of Harness?

How can I get notified if Harness is down or experiencing an outage?

What does Harness do?

Start monitoring now!