Harness Status: Check if Harness down or having an outage.

Harness outages and incidents

Outage and incident data over the last 30 days for Harness.

There have been 3 outages or incidents for Harness in the last 30 days.

Severity Breakdown:

None: 0

Minor: 3

Major: 0

Critical: 0

Tired of searching for status updates?

Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!

Components and Services Monitored for Harness

Outlogger tracks the status of these components for Xero:

Service Reliability Management - Error Tracking FirstGen (fka OverOps) Active

Software Engineering Insights FirstGen (fka Propelo) Active

Prod 1

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery (CD) - FirstGen - EOS Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Software Engineering Insights (SEI) Active

Software Supply Chain Assurance (SSCA) Active

Prod 2

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery (CD) - FirstGen - EOS Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Software Engineering Insights (SEI) Active

Software Supply Chain Assurance (SSCA) Active

Prod 3

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery (CD) - FirstGen - EOS Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Software Supply Chain Assurance (SSCA) Active

Prod 4

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Prod Eu1

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Component	Status
Service Reliability Management - Error Tracking FirstGen (fka OverOps)	Active
Software Engineering Insights FirstGen (fka Propelo)	Active
Prod 1	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery (CD) - FirstGen - EOS	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active
Software Engineering Insights (SEI)	Active
Software Supply Chain Assurance (SSCA)	Active
Prod 2	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery (CD) - FirstGen - EOS	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active
Software Engineering Insights (SEI)	Active
Software Supply Chain Assurance (SSCA)	Active
Prod 3	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery (CD) - FirstGen - EOS	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active
Software Supply Chain Assurance (SSCA)	Active
Prod 4	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active
Prod Eu1	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active

Latest Harness outages and incidents.

View the latest incidents for Harness and check for official updates:

Some customers on Prod1 may be experiencing degraded performance

Description: This incident has been resolved. We will provide an RCA after findings are complete.

Status: Resolved

Impact: Minor | Started At: Oct. 16, 2024, 5:09 p.m.

Updates:

Time: Oct. 16, 2024, 6:03 p.m.

Status: Resolved

Update: This incident has been resolved. We will provide an RCA after findings are complete.
Time: Oct. 16, 2024, 5:49 p.m.

Status: Monitoring

Update: The issue has been mitigated. We are still monitoring the system to ensure healthy operation of the cluster.
Time: Oct. 16, 2024, 5:37 p.m.

Status: Identified

Update: We have identified the service that is causing the degradation. We have scaled up the DB resource for that service. We are still working to mitigate the issue.
Time: Oct. 16, 2024, 5:09 p.m.

Status: Investigating

Update: We have internally found an issue that is impacting the optimal performance for Prod1 customers. We are actively investigating this.

Pipeline Steps Timing out for a subset of customers in Prod2

Description: ## **Summary:** Pipeline executions were failing with a time-out error on Prod2. This affected ~3% of pipeline executions. ## **What was the issue?** Tasks are execution units that run on a delegate as part of a pipeline execution. As a pipeline runs, its tasks are broadcast to delegates, and one eligible delegate picks up the task for execution. In case any delegate does not acquire the task within the stipulated time, it is rebroadcast. During this incident, rebroadcast functionality was affected, resulting in pipeline executions getting timed out. ## **Resolution:** We rolled back the service to resolve the issue. ## **RCA** An incompatibility change was rolled out in one of our micro-services, causing deserialization failure for a subset of task types. The rebroadcast threads went into an error state due to this deserialization error, resulting in the failure of pipelines that required task rebroadcasts. The system recovered upon the service's rollback. ‌ **Action Item** 1. Added a critical alert for rebroadcast events. 2. Rebroadbast logic is made resilient to task deserialization errors. 3. Unit Test added to catch incompatible contract changes for task data.

Status: Postmortem

Impact: Minor | Started At: Oct. 14, 2024, 3:08 p.m.

Updates:

Time: Oct. 31, 2024, 4:28 a.m.

Status: Postmortem

Update: ## **Summary:** Pipeline executions were failing with a time-out error on Prod2. This affected ~3% of pipeline executions. ## **What was the issue?** Tasks are execution units that run on a delegate as part of a pipeline execution. As a pipeline runs, its tasks are broadcast to delegates, and one eligible delegate picks up the task for execution. In case any delegate does not acquire the task within the stipulated time, it is rebroadcast. During this incident, rebroadcast functionality was affected, resulting in pipeline executions getting timed out. ## **Resolution:** We rolled back the service to resolve the issue. ## **RCA** An incompatibility change was rolled out in one of our micro-services, causing deserialization failure for a subset of task types. The rebroadcast threads went into an error state due to this deserialization error, resulting in the failure of pipelines that required task rebroadcasts. The system recovered upon the service's rollback. ‌ **Action Item** 1. Added a critical alert for rebroadcast events. 2. Rebroadbast logic is made resilient to task deserialization errors. 3. Unit Test added to catch incompatible contract changes for task data.
Time: Oct. 14, 2024, 5:25 p.m.

Status: Resolved

Update: The incident has been resolved. We will be sharing a RCA with improvements in monitoring and other steps.
Time: Oct. 14, 2024, 5:08 p.m.

Status: Monitoring

Update: The issue has been fixed and we are monitoring the system.
Time: Oct. 14, 2024, 4:01 p.m.

Status: Identified

Update: The issue has been identified and we are still working on a fix.
Time: Oct. 14, 2024, 3:08 p.m.

Status: Investigating

Update: We are currently investigating an issue where the clone codebase step is failing for a subset of customers in Prod2.

Harness cloud builds failing at initialise step for MAC users

Description: ### **Summary** CI-hosted MacOS pipelines were failing during the initialisation step, impacting specific customers using our MacOS-hosted service. ### What was the issue? We tightened a firewall rule for our Mac VM registry that was previously too permissive. As a result, another component couldn’t access the registry, leading to pipeline failures. ### **Resolution** | **Time** | **Event** | | --- | --- | | Sept 1st, 17:00 UTC | Restricted the firewall rule. | | Sept 04, 06:03 UTC | Issue reported by the customer. | | Sept 04, 08:39 UTC | We re-created the firewall rule and validated that the issue was fixed. | ### RCA Our MacOS production setup includes several components. When we restricted the permissive firewall rule, the new rule did not account for the NAT IP address of one of these components. After the change, we ran a full sanity pipeline on the Mac machines, which passed successfully. The issue didn’t surface immediately as the affected component maintains a persistent socket connection, unaffected by the firewall until the connection is re-established or restarted. This explains why the failure didn’t occur immediately after we removed the permissive rule on September 1st. We restored the rule, and the issue was resolved. ### Action Items 1. Restrict the firewall rule again, ensuring that necessary NAT IPs are included. 2. Restart all relevant services when applying firewall rule restrictions. 3. Ensure that all connections are properly drained and re-established when the change is implemented.

Status: Postmortem

Impact: Major | Started At: Sept. 4, 2024, 6:33 a.m.

Updates:

Time: Sept. 17, 2024, 10:29 a.m.

Status: Postmortem

Update: ### **Summary** CI-hosted MacOS pipelines were failing during the initialisation step, impacting specific customers using our MacOS-hosted service. ### What was the issue? We tightened a firewall rule for our Mac VM registry that was previously too permissive. As a result, another component couldn’t access the registry, leading to pipeline failures. ### **Resolution** | **Time** | **Event** | | --- | --- | | Sept 1st, 17:00 UTC | Restricted the firewall rule. | | Sept 04, 06:03 UTC | Issue reported by the customer. | | Sept 04, 08:39 UTC | We re-created the firewall rule and validated that the issue was fixed. | ### RCA Our MacOS production setup includes several components. When we restricted the permissive firewall rule, the new rule did not account for the NAT IP address of one of these components. After the change, we ran a full sanity pipeline on the Mac machines, which passed successfully. The issue didn’t surface immediately as the affected component maintains a persistent socket connection, unaffected by the firewall until the connection is re-established or restarted. This explains why the failure didn’t occur immediately after we removed the permissive rule on September 1st. We restored the rule, and the issue was resolved. ### Action Items 1. Restrict the firewall rule again, ensuring that necessary NAT IPs are included. 2. Restart all relevant services when applying firewall rule restrictions. 3. Ensure that all connections are properly drained and re-established when the change is implemented.
Time: Sept. 4, 2024, 6:47 a.m.

Status: Resolved

Update: We apologise for the inconvenience caused by this outage. We will make sure to provide the root cause analysis soon.
Time: Sept. 4, 2024, 6:39 a.m.

Status: Monitoring

Update: The issue is resolved now. We will be sharing RCA for the problem as soon as possible.
Time: Sept. 4, 2024, 6:33 a.m.

Status: Investigating

Update: We are currently investigating this issue.

Login issues on Prod4

Description: ## **Summary:** Logged in users started getting redirected to the enrollment screen with “Email verified successfully” message and forced users to enter user details again. Pipeline executions and backend tasks were not impacted. Impact was for accounts in Prod 4 cluster. ## **What was the issue?** We released an incompatible version of Nextgen UI service, resulting in unexpected user flow of new sign up for existing users. This was a human error. ## **Timeline:** ## **Resolution:** | **Time** | **Event** | | --- | --- | | September 03 7:45 PM UTC | Customer reported Login redirection to SignUP page | | September 03 8:15 PM UTC | New deployment happened around the same time. Decided to rollback | | September 03 8:20 PM UTC | Started the partial rollback of FF Proxy changes | | September 03 8:30 PM UTC | Partial rollback didn’t fix the issue. Initiated full rollback | | September 03 9:00 PM UTC | Complete rollback completed and issue resolved | Rollback resolved the issue. ## **RCA** There was a human error in picking the version of NextGen UI service. Post deployment sanity did not catch this issue. Rolling back took longer than expected as multiple services got deployed together. **Action Item** 1. Remove manual process to pick the service versions. Automate the promotion process from lower environments. 2. Improve sanity test to catch above UI flow. 3. Make the rollback process atomic based on the previous known good state.

Status: Postmortem

Impact: Major | Started At: Sept. 3, 2024, 7:45 p.m.

Updates:

Time: Sept. 11, 2024, 11 p.m.

Status: Postmortem

Update: ## **Summary:** Logged in users started getting redirected to the enrollment screen with “Email verified successfully” message and forced users to enter user details again. Pipeline executions and backend tasks were not impacted. Impact was for accounts in Prod 4 cluster. ## **What was the issue?** We released an incompatible version of Nextgen UI service, resulting in unexpected user flow of new sign up for existing users. This was a human error. ## **Timeline:** ## **Resolution:** | **Time** | **Event** | | --- | --- | | September 03 7:45 PM UTC | Customer reported Login redirection to SignUP page | | September 03 8:15 PM UTC | New deployment happened around the same time. Decided to rollback | | September 03 8:20 PM UTC | Started the partial rollback of FF Proxy changes | | September 03 8:30 PM UTC | Partial rollback didn’t fix the issue. Initiated full rollback | | September 03 9:00 PM UTC | Complete rollback completed and issue resolved | Rollback resolved the issue. ## **RCA** There was a human error in picking the version of NextGen UI service. Post deployment sanity did not catch this issue. Rolling back took longer than expected as multiple services got deployed together. **Action Item** 1. Remove manual process to pick the service versions. Automate the promotion process from lower environments. 2. Improve sanity test to catch above UI flow. 3. Make the rollback process atomic based on the previous known good state.
Time: Sept. 11, 2024, 10:57 p.m.

Status: Resolved

Update: We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.
Time: Sept. 11, 2024, 10:55 p.m.

Status: Investigating

Update: Logged in users started getting redirected to the enrollment screen. Currently investigating

Customers unable to access Harness on Prod4 Cluster

Description: ## **Summary:** Customer experienced login failures with 5xx errors on Prod4 cluster. ## **What was the issue?** Harness platform uses managed memStore internally which experienced “Host error”, this triggered master switchover within seconds. Backend microservices which connect to memStore were not able to reconnect quickly. This issue was with JAVA based services but GO services reconnected properly. ## **Timeline:** | **Time** | **Event** | | --- | --- | | 21 August 4:06:41 PM UTC | Primary memStore went down | | 21 August 4:07:00 PM UTC | Secondary memStore promoted to Primary | | 21 August 4:06:41 PM UTC | Harness services experience RedisResponseTimeoutException | | 21 August 4:14:30 PM UTC | Harness services restores connectivity to new Primary | | 21 August 4:14:53 PM UTC | New instance of memstore added and promoted as Secondary | ## **Resolution:** After 8 min services reconnected to the new primary memStore on its own and things recovered. ## **RCA** JAVA services use redisson library to connect to memStore. The established connection pool doesn’t detect the endpoint going away and these connections eventually get timed out. In case of graceful failover this issue doesn’t happen and only in case of catastrophic failure we encounter this issue. **Action Item** * Detect this catastrophic failure and do a quicker reconnect by services

Status: Postmortem

Impact: Critical | Started At: Aug. 21, 2024, 4:06 p.m.

Updates:

Time: Sept. 4, 2024, 5:31 p.m.

Status: Postmortem

Update: ## **Summary:** Customer experienced login failures with 5xx errors on Prod4 cluster. ## **What was the issue?** Harness platform uses managed memStore internally which experienced “Host error”, this triggered master switchover within seconds. Backend microservices which connect to memStore were not able to reconnect quickly. This issue was with JAVA based services but GO services reconnected properly. ## **Timeline:** | **Time** | **Event** | | --- | --- | | 21 August 4:06:41 PM UTC | Primary memStore went down | | 21 August 4:07:00 PM UTC | Secondary memStore promoted to Primary | | 21 August 4:06:41 PM UTC | Harness services experience RedisResponseTimeoutException | | 21 August 4:14:30 PM UTC | Harness services restores connectivity to new Primary | | 21 August 4:14:53 PM UTC | New instance of memstore added and promoted as Secondary | ## **Resolution:** After 8 min services reconnected to the new primary memStore on its own and things recovered. ## **RCA** JAVA services use redisson library to connect to memStore. The established connection pool doesn’t detect the endpoint going away and these connections eventually get timed out. In case of graceful failover this issue doesn’t happen and only in case of catastrophic failure we encounter this issue. **Action Item** * Detect this catastrophic failure and do a quicker reconnect by services
Time: Aug. 21, 2024, 6:47 p.m.

Status: Resolved

Update: We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.
Time: Aug. 21, 2024, 6:46 p.m.

Status: Investigating

Update: We are currently investigating this issue.

Check the status of similar companies and alternatives to Harness

UiPath

Systems Active

Scale AI

Systems Active

Notion

Systems Active

Brandwatch

Systems Active

Olive AI

Systems Active

Sisense

Systems Active

HeyJobs

Systems Active

Joveo

Systems Active

Seamless AI

Systems Active

EdCast by Cornerstone

Systems Active

hireEZ

Systems Active

Alchemy

Systems Active

Frequently Asked Questions - Harness

Is there a Harness outage?

The current status of Harness is: Systems Active

Where can I find the official status page of Harness?

The official status page for Harness is here

How can I get notified if Harness is down or experiencing an outage?

To get notified of any status changes to Harness, simply sign up to OutLogger's free monitoring service. OutLogger checks the official status of Harness every few minutes and will notify you of any changes. You can veiw the status of all your cloud vendors in one dashboard. Sign up here

What does Harness do?

Harness is a software delivery platform that enables engineers and DevOps to build, test, deploy, and verify software as needed.

Is there an Harness outage?

Harness status: Systems Active

Harness outages and incidents

There have been 3 outages or incidents for Harness in the last 30 days.

Severity Breakdown:

Tired of searching for status updates?

Components and Services Monitored for Harness

Prod 1

Prod 2

Prod 3

Prod 4

Prod Eu1

Latest Harness outages and incidents.

Some customers on Prod1 may be experiencing degraded performance

Updates:

Pipeline Steps Timing out for a subset of customers in Prod2

Updates:

Harness cloud builds failing at initialise step for MAC users

Updates:

Login issues on Prod4

Updates:

Customers unable to access Harness on Prod4 Cluster

Updates:

Check the status of similar companies and alternatives to Harness

Frequently Asked Questions - Harness

Is there a Harness outage?

Where can I find the official status page of Harness?

How can I get notified if Harness is down or experiencing an outage?

What does Harness do?

Start monitoring now!