Harness Status: Check if Harness down or having an outage.

Harness outages and incidents

Outage and incident data over the last 30 days for Harness.

There have been 3 outages or incidents for Harness in the last 30 days.

Severity Breakdown:

None: 0

Minor: 3

Major: 0

Critical: 0

Tired of searching for status updates?

Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!

Components and Services Monitored for Harness

Outlogger tracks the status of these components for Xero:

Service Reliability Management - Error Tracking FirstGen (fka OverOps) Active

Software Engineering Insights FirstGen (fka Propelo) Active

Prod 1

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery (CD) - FirstGen - EOS Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Software Engineering Insights (SEI) Active

Software Supply Chain Assurance (SSCA) Active

Prod 2

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery (CD) - FirstGen - EOS Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Software Engineering Insights (SEI) Active

Software Supply Chain Assurance (SSCA) Active

Prod 3

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery (CD) - FirstGen - EOS Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Software Supply Chain Assurance (SSCA) Active

Prod 4

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Prod Eu1

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Component	Status
Service Reliability Management - Error Tracking FirstGen (fka OverOps)	Active
Software Engineering Insights FirstGen (fka Propelo)	Active
Prod 1	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery (CD) - FirstGen - EOS	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active
Software Engineering Insights (SEI)	Active
Software Supply Chain Assurance (SSCA)	Active
Prod 2	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery (CD) - FirstGen - EOS	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active
Software Engineering Insights (SEI)	Active
Software Supply Chain Assurance (SSCA)	Active
Prod 3	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery (CD) - FirstGen - EOS	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active
Software Supply Chain Assurance (SSCA)	Active
Prod 4	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active
Prod Eu1	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active

Latest Harness outages and incidents.

View the latest incidents for Harness and check for official updates:

Feature Flag target segments are failing in Prod-1/2 for customers with no target groups

Description: # Summary Customers with no target groups configured were being returned `null` instead of `[]` for the /target-segments request when their sdks started up. This could lead to null pointer exceptions and a failure to initialise for some sdks. ## SDK Customer Impact Issue was related to a number of SDK versions, so we tested them and the latest versions to ascertain impact. Java * 1.3.1: * Behaviour: the `waitForInitialzation` call never unblocks once the exception is thrown and caught in the polling thread. This would have caused user code to “freeze” when the SDK is initialising. * Impact: critical impact, the SDK blocks users code from executing. * 1.6.0 - latest version: * Behaviour: An error is logged that the group was null and that the size of the group could not be calculated. Flags are loaded correctly and the `waitForInitialzation` call unblocks and evaluation calls return the correct variation. * Impact: no functional impact outside of error logs. The correct evaluation will be returned. Node.js * 1.3.1: * Behaviour: `UnhandledPromiseRejection` causes the SDK and application to crash. If the SDK client and `waitForInitialzation` were used in a try-catch block, then an error would be logged and the SDK would serve the correct evaluations. * Impact: * If no exception handling was used on the client, then critical impact and the user’s application would crash. * If exception handling was used, no functional impact outside of error logs. The correct evaluation will be returned. * 1.8.1 - latest version: same behaviour and impact as 1.3.1 Other SDK impact The remaining server SDKs have been tested on their latest versions to ascertain impact. While there were no direct customer reports of issues, this is useful to understand the scope of this issue. * Erlang: 3.0.0: Critical impact, exception thrown could cause an application not to start, depending on how the SDK has been integrated. * Python 1.6.2: High impact, SDK fails to initialise and serves default variations. * .NET 1.7.0: No impact, but error is logged. * Go v0.1.23: No impact. * Ruby 1.3.0: No impact. ## RCA ### Why did some customers experience null pointer exceptions? In the scenario where a customer had 0 target groups the /client/target-segments endpoint returned the value `null` instead of an empty array `[]` ### Why was null being returned instead of an empty array? A change was made in the db layer of the backend to return an empty array rather than a not found error when no target groups exist for a customer. This had the impact of hitting a different codepath. This codepath copies all groups into a new array and changes some data before marshalling and returning the json response. Because no groups existed this copy would mistakenly end up returning a nil object instead of an empty array, which then got marshalled into the `null` json response. ### Why was that change made to begin with? Because of our high request rates we use many layers of caching. A side effect of returning errors from the db layer when no target groups exist was that we wouldn’t cache that response. With some high volume customers having no target groups this led to tens of millions of unnecessary requests hitting the database per week when flags are evaluated which we were attempting to avoid. ### Why was this scenario not caught by tests? Unit tests, end to end tests and sdk specific tests exist for this endpoint however the case where target groups are empty wasn’t full covered. This change was primarily meant to improve performance for the /client/evaluations endpoint which uses this code path and which was manually tested and confirmed to work correctly. The /client/target-segments code path experiencing side affects from this change wasn’t anticipated or caught by automated testing. ## Follow up actions Followup actions can cover the following based on different issues faced along with Jira id’s linked for tracking the followup completion. these followup items must ALSO be linked in the RCA ticket * Test enhancements * Add unit tests for the target groups, working with both none, one and multiple * Add new validation and logging, to ensure valid JSON is already returned by the endpoints

Status: Postmortem

Impact: Minor | Started At: June 25, 2024, 4:15 p.m.

Updates:

Time: June 27, 2024, 2:18 p.m.

Status: Postmortem

Update: # Summary Customers with no target groups configured were being returned `null` instead of `[]` for the /target-segments request when their sdks started up. This could lead to null pointer exceptions and a failure to initialise for some sdks. ## SDK Customer Impact Issue was related to a number of SDK versions, so we tested them and the latest versions to ascertain impact. Java * 1.3.1: * Behaviour: the `waitForInitialzation` call never unblocks once the exception is thrown and caught in the polling thread. This would have caused user code to “freeze” when the SDK is initialising. * Impact: critical impact, the SDK blocks users code from executing. * 1.6.0 - latest version: * Behaviour: An error is logged that the group was null and that the size of the group could not be calculated. Flags are loaded correctly and the `waitForInitialzation` call unblocks and evaluation calls return the correct variation. * Impact: no functional impact outside of error logs. The correct evaluation will be returned. Node.js * 1.3.1: * Behaviour: `UnhandledPromiseRejection` causes the SDK and application to crash. If the SDK client and `waitForInitialzation` were used in a try-catch block, then an error would be logged and the SDK would serve the correct evaluations. * Impact: * If no exception handling was used on the client, then critical impact and the user’s application would crash. * If exception handling was used, no functional impact outside of error logs. The correct evaluation will be returned. * 1.8.1 - latest version: same behaviour and impact as 1.3.1 Other SDK impact The remaining server SDKs have been tested on their latest versions to ascertain impact. While there were no direct customer reports of issues, this is useful to understand the scope of this issue. * Erlang: 3.0.0: Critical impact, exception thrown could cause an application not to start, depending on how the SDK has been integrated. * Python 1.6.2: High impact, SDK fails to initialise and serves default variations. * .NET 1.7.0: No impact, but error is logged. * Go v0.1.23: No impact. * Ruby 1.3.0: No impact. ## RCA ### Why did some customers experience null pointer exceptions? In the scenario where a customer had 0 target groups the /client/target-segments endpoint returned the value `null` instead of an empty array `[]` ### Why was null being returned instead of an empty array? A change was made in the db layer of the backend to return an empty array rather than a not found error when no target groups exist for a customer. This had the impact of hitting a different codepath. This codepath copies all groups into a new array and changes some data before marshalling and returning the json response. Because no groups existed this copy would mistakenly end up returning a nil object instead of an empty array, which then got marshalled into the `null` json response. ### Why was that change made to begin with? Because of our high request rates we use many layers of caching. A side effect of returning errors from the db layer when no target groups exist was that we wouldn’t cache that response. With some high volume customers having no target groups this led to tens of millions of unnecessary requests hitting the database per week when flags are evaluated which we were attempting to avoid. ### Why was this scenario not caught by tests? Unit tests, end to end tests and sdk specific tests exist for this endpoint however the case where target groups are empty wasn’t full covered. This change was primarily meant to improve performance for the /client/evaluations endpoint which uses this code path and which was manually tested and confirmed to work correctly. The /client/target-segments code path experiencing side affects from this change wasn’t anticipated or caught by automated testing. ## Follow up actions Followup actions can cover the following based on different issues faced along with Jira id’s linked for tracking the followup completion. these followup items must ALSO be linked in the RCA ticket * Test enhancements * Add unit tests for the target groups, working with both none, one and multiple * Add new validation and logging, to ensure valid JSON is already returned by the endpoints
Time: June 25, 2024, 5:16 p.m.

Status: Resolved

Update: Between 09:18 and 16:08 UTC, customers with no evaluation groups seeing `NullPointerException` errors in the SDK's, when pulling evaluation rules. In the scenario where a customer had 0 target groups the /client/target-segments endpoint returned the value null instead of an empty array []. A change was made in the db layer of the backend to return an empty array rather than a not found error when no target groups exist for a customer. This had the impact of hitting a different codepath. This codepath copies all groups into a new array and changes some data before marshalling and returning the json response. Because no groups existed this copy would mistakenly end up returning a nil object instead of an empty array, which then got marshalled into the null json response. We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.

Pipelines Execution encountered issues

Description: ## **Summary:** Pipeline executions across CI/CD are not advancing as anticipated with some even getting stuck in the prod2 cluster. ## **What was the issue?** One of the micro services was running low on resources due to a pipeline execution consuming more resources when attempting to resolve expressions. This led to all pipelines running slower by \(80%\) during the duration of the incident with few executions getting into an unresponsive state. ## **Timeline:** ## **Resolution:** The pipeline execution that was consuming more resources was aborted and the service pods were restarted to recover the system. ## **RCA** A pipeline contained circular references in the files. The runtime resolution of these references resulted in excessive threads getting into the Waiting state. Although the issue was automatically detected \(and the system was auto-protected by breaking the circuit\), the excessive threads were still consumed due to a higher threshold on the loop detection logic. Since then, we have further reduced the loop threshold and automated blocking such runaway pipeline executions.

Status: Postmortem

Impact: Minor | Started At: June 3, 2024, 11:30 p.m.

Updates:

Time: June 5, 2024, 6:35 p.m.

Status: Postmortem

Update: ## **Summary:** Pipeline executions across CI/CD are not advancing as anticipated with some even getting stuck in the prod2 cluster. ## **What was the issue?** One of the micro services was running low on resources due to a pipeline execution consuming more resources when attempting to resolve expressions. This led to all pipelines running slower by \(80%\) during the duration of the incident with few executions getting into an unresponsive state. ## **Timeline:** ## **Resolution:** The pipeline execution that was consuming more resources was aborted and the service pods were restarted to recover the system. ## **RCA** A pipeline contained circular references in the files. The runtime resolution of these references resulted in excessive threads getting into the Waiting state. Although the issue was automatically detected \(and the system was auto-protected by breaking the circuit\), the excessive threads were still consumed due to a higher threshold on the loop detection logic. Since then, we have further reduced the loop threshold and automated blocking such runaway pipeline executions.
Time: June 4, 2024, 12:32 a.m.

Status: Resolved

Update: We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.
Time: June 4, 2024, 12:27 a.m.

Status: Monitoring

Update: We are continuing to monitor for any further issues.
Time: June 4, 2024, 12:17 a.m.

Status: Monitoring

Update: Harness service issues have been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.
Time: June 3, 2024, 11:49 p.m.

Status: Identified

Update: We have identified a potential cause of the pipleine service issues and are working hard to address it. Please continue to monitor this page for updates.

Pipeline executions are failing in Prod-1/2

Description: **What was the issue?** Pipeline executions were stuck and failing due to a Redis instance in the primary region \(us-west1\) becoming unresponsive. Impact was limited to pipeline executions in NextGen. ‌ **Timeline** ‌ | **Time** | **Event** | | --- | --- | | 28 May - 8:56 PM PDT | We received alert on high memory utilization on our Redis instance. Team identified a particularly large cache key and attempted to manually delete the key. This resulted in the Redis instance becoming unstable. Pipeline executions began failing or hanging. We are still working with our Redis service provider to understand why key deletion resulted in instance unresponsiveness. Such key deletions have been done previously without issue, consistent with vendor advice. | | 28 May - 9:15 PM PDT | We engaged Redis support team, and executed a mitigation plan. | | 28 May - 9:40 PM PDT | We failed over our application to the secondary Redis instance. The system started to recover. Unfortunately, the secondary region also went into bad state as we migrated load. | | 28 May - 10:02 PM PDT | The primary redis instance became healthy. | | 28 May - 10:25 PM PDT | Application traffic was migrated back to primary Redis region. This restored functionality. | ‌ **RCA & Action Items:** Pipelines started failing due to Redis instance becoming unresponsive in the primary region. This was due to manual deletion of a large key. We are working with the vendor to determine why this action caused instability. We will share more details when they become available. ‌ As part of the immediate action items, we are implementing an upper bound on cache key size. This will prevent us from getting into the state of high memory usage due to large cache keys. In the longer term, we are revisiting our architecture to eliminate large keys in Redis altogether.

Status: Postmortem

Impact: Critical | Started At: May 29, 2024, 4:06 a.m.

Updates:

Time: May 29, 2024, 6:20 p.m.

Status: Postmortem

Update: **What was the issue?** Pipeline executions were stuck and failing due to a Redis instance in the primary region \(us-west1\) becoming unresponsive. Impact was limited to pipeline executions in NextGen. ‌ **Timeline** ‌ | **Time** | **Event** | | --- | --- | | 28 May - 8:56 PM PDT | We received alert on high memory utilization on our Redis instance. Team identified a particularly large cache key and attempted to manually delete the key. This resulted in the Redis instance becoming unstable. Pipeline executions began failing or hanging. We are still working with our Redis service provider to understand why key deletion resulted in instance unresponsiveness. Such key deletions have been done previously without issue, consistent with vendor advice. | | 28 May - 9:15 PM PDT | We engaged Redis support team, and executed a mitigation plan. | | 28 May - 9:40 PM PDT | We failed over our application to the secondary Redis instance. The system started to recover. Unfortunately, the secondary region also went into bad state as we migrated load. | | 28 May - 10:02 PM PDT | The primary redis instance became healthy. | | 28 May - 10:25 PM PDT | Application traffic was migrated back to primary Redis region. This restored functionality. | ‌ **RCA & Action Items:** Pipelines started failing due to Redis instance becoming unresponsive in the primary region. This was due to manual deletion of a large key. We are working with the vendor to determine why this action caused instability. We will share more details when they become available. ‌ As part of the immediate action items, we are implementing an upper bound on cache key size. This will prevent us from getting into the state of high memory usage due to large cache keys. In the longer term, we are revisiting our architecture to eliminate large keys in Redis altogether.
Time: May 29, 2024, 7:21 a.m.

Status: Resolved

Update: Harness services are now stable, and our internal sanity check has passed. We will publish more details as soon as our vendor partner, Redis shares the RCA with us. We apologise for the disruption of service.
Time: May 29, 2024, 5:37 a.m.

Status: Monitoring

Update: Service issues have been addressed and normal operations has been resumed. We are monitoring the service to ensure normal performance continues. Thanks you for your patience!
Time: May 29, 2024, 5:29 a.m.

Status: Identified

Update: We have identified the issue to be with redis cache. We are working with the vendor to get this fixed and team is working with utmost urgency to get this resolved
Time: May 29, 2024, 4:36 a.m.

Status: Identified

Update: Pipeline executions are failing in Prod-1/2 due to dependency failure.

Intermittent pipeline timeouts

Description: ## What was the issue? Pipelines in Prod1 were experiencing intermittent failures caused by gRPC connection issues between Harness services. The majority of the failed gRPC requests occurred between the CI Manager and Harness Manager \(CG\), resulting in a primary impact on CI pipelines. ## Timeline | **Time** | **Event** | | --- | --- | | May 28, 9:50 AM PDT | The issue was reported by a customer regarding intermittent pipeline failures. The team initiated an investigation but did not identify any issues with the infrastructure. It was determined that the failure was isolated to one specific customer. Teams were promptly alerted to monitor the pipelines of this particular customer. | | May 28, 3:05 PM PDT | The engineering team has observed a few more number occurrences of the issue across other pipelines. | | May 28, 3:30 PM PDT | The team has made the decision to rollback a recent deployment in order to investigate any potential correlations. | | May 28, 5:15 PM PDT | The status for the Prod1 environment has been updated to "degraded performance" due to intermittent issues. | | May 28, 6:20 PM PDT | The issue was suspected to be related to kube DNS resolution, resulting in some gRPC requests failing randomly. GCP support was engaged for further investigation. Additionally, service thread dumps were captured for internal debugging purposes, revealing no suspicious findings within the thread dumps. | | May 28, 7:00 PM PDT | Declared status to “monitoring”. | ## RCA and Action Items: Pipelines experienced failures as a result of internal service communication issues, despite multiple attempts. Initially, the Engineering team suspected kube-DNS problems; however, after consulting with GCP support, this was ruled out. It was noted that certain service pod replicas were receiving an uneven distribution of requests. To tackle this issue, the following corrective actions are currently underway: 1. Enhancing load balancing for gRPC calls among service pods. 2. Incorporating traceId in delegate task submissions. Furthermore, we have set up alerts related to gRPC to address similar situations in the future.

Status: Postmortem

Impact: Minor | Started At: May 29, 2024, 12:15 a.m.

Updates:

Time: June 4, 2024, 11:28 p.m.

Status: Postmortem

Update: ## What was the issue? Pipelines in Prod1 were experiencing intermittent failures caused by gRPC connection issues between Harness services. The majority of the failed gRPC requests occurred between the CI Manager and Harness Manager \(CG\), resulting in a primary impact on CI pipelines. ## Timeline | **Time** | **Event** | | --- | --- | | May 28, 9:50 AM PDT | The issue was reported by a customer regarding intermittent pipeline failures. The team initiated an investigation but did not identify any issues with the infrastructure. It was determined that the failure was isolated to one specific customer. Teams were promptly alerted to monitor the pipelines of this particular customer. | | May 28, 3:05 PM PDT | The engineering team has observed a few more number occurrences of the issue across other pipelines. | | May 28, 3:30 PM PDT | The team has made the decision to rollback a recent deployment in order to investigate any potential correlations. | | May 28, 5:15 PM PDT | The status for the Prod1 environment has been updated to "degraded performance" due to intermittent issues. | | May 28, 6:20 PM PDT | The issue was suspected to be related to kube DNS resolution, resulting in some gRPC requests failing randomly. GCP support was engaged for further investigation. Additionally, service thread dumps were captured for internal debugging purposes, revealing no suspicious findings within the thread dumps. | | May 28, 7:00 PM PDT | Declared status to “monitoring”. | ## RCA and Action Items: Pipelines experienced failures as a result of internal service communication issues, despite multiple attempts. Initially, the Engineering team suspected kube-DNS problems; however, after consulting with GCP support, this was ruled out. It was noted that certain service pod replicas were receiving an uneven distribution of requests. To tackle this issue, the following corrective actions are currently underway: 1. Enhancing load balancing for gRPC calls among service pods. 2. Incorporating traceId in delegate task submissions. Furthermore, we have set up alerts related to gRPC to address similar situations in the future.
Time: May 29, 2024, 6:06 a.m.

Status: Resolved

Update: We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.
Time: May 29, 2024, 2:01 a.m.

Status: Monitoring

Update: Harness service issues have been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.
Time: May 29, 2024, 1:21 a.m.

Status: Identified

Update: We have identified that the intermittent service disruption is limited to a small number of customers. Our team is actively engaged in investigating the root cause of the issue and is working diligently to restore full functionality as quickly as possible. We understand the importance of service reliability and the inconvenience this may cause for our affected customers. We apologize for any disruption this may have caused and appreciate your patience during this time. We will provide further updates as soon as we have more information or when the issue has been fully resolved.
Time: May 29, 2024, 12:15 a.m.

Status: Investigating

Update: Current Status: The team is actively investigating intermittent timeouts occurring in our pipeline services. Engineers are working to identify the root cause and implement a fix. User Impact: Some users may experience delays or failures when trying to access or use pipeline-related functionality. Not all users are impacted. We apologize for any inconvenience caused by these pipeline service timeouts. The team is treating this as a high priority issue and is working diligently to restore full performance and stability as quickly as possible. Your patience is appreciated as we work through this matter.

CG Secret manager is broken in Prod-2

Description: ## Summary A few trial Current Gen\(CG\) CD customers had difficulty Login and accessing Harness Secret Manager in our Prod-2 environment. There was no impact on Next Gen\(NG\) customers. ## Timeline | **Time \(UTC\)** | **Event** | | --- | --- | | 04:22 am | We received internal alerts for an increased error rate related to Secret Manager. | | 06:09 am | An incident was raised when a trial customer reached out. | | 06:28 am | The root cause was identified. | | 07:34 am | Incident was resolved | ## Resolution Upon referencing the code base, we identified a key configuration missing in our database records. The required data was restored from periodic snapshots. ## RCA As a part of the official EOL for CG CD, a cleanup activity in the backend database for all internal accounts was done. In the cleanup process, a legacy configuration which the **Harness Secret Manager** used was deleted. ## Action Items We will perform stringent checks on data before cleanup, followed by sanity to ensure no functionality gets impacted.

Status: Postmortem

Impact: Major | Started At: May 3, 2024, 6:19 a.m.

Updates:

Time: May 6, 2024, 6:47 a.m.

Status: Postmortem

Update: ## Summary A few trial Current Gen\(CG\) CD customers had difficulty Login and accessing Harness Secret Manager in our Prod-2 environment. There was no impact on Next Gen\(NG\) customers. ## Timeline | **Time \(UTC\)** | **Event** | | --- | --- | | 04:22 am | We received internal alerts for an increased error rate related to Secret Manager. | | 06:09 am | An incident was raised when a trial customer reached out. | | 06:28 am | The root cause was identified. | | 07:34 am | Incident was resolved | ## Resolution Upon referencing the code base, we identified a key configuration missing in our database records. The required data was restored from periodic snapshots. ## RCA As a part of the official EOL for CG CD, a cleanup activity in the backend database for all internal accounts was done. In the cleanup process, a legacy configuration which the **Harness Secret Manager** used was deleted. ## Action Items We will perform stringent checks on data before cleanup, followed by sanity to ensure no functionality gets impacted.
Time: May 3, 2024, 7:22 a.m.

Status: Resolved

Update: The issue with CG Secret manager has been resolved for all customers as of May 03, 2024 - 00:10 PDT. We thank you for your patience while we worked in resolving the issue.
Time: May 3, 2024, 7:10 a.m.

Status: Monitoring

Update: Our engineering team has resolved the issue , We are actively monitoring the issue.
Time: May 3, 2024, 6:19 a.m.

Status: Identified

Update: The issue has been identified and a fix is being implemented.

Check the status of similar companies and alternatives to Harness

UiPath

Systems Active

Scale AI

Systems Active

Notion

Systems Active

Brandwatch

Systems Active

Olive AI

Systems Active

Sisense

Systems Active

HeyJobs

Systems Active

Joveo

Systems Active

Seamless AI

Systems Active

EdCast by Cornerstone

Systems Active

hireEZ

Systems Active

Alchemy

Systems Active

Frequently Asked Questions - Harness

Is there a Harness outage?

The current status of Harness is: Systems Active

Where can I find the official status page of Harness?

The official status page for Harness is here

How can I get notified if Harness is down or experiencing an outage?

To get notified of any status changes to Harness, simply sign up to OutLogger's free monitoring service. OutLogger checks the official status of Harness every few minutes and will notify you of any changes. You can veiw the status of all your cloud vendors in one dashboard. Sign up here

What does Harness do?

Harness is a software delivery platform that enables engineers and DevOps to build, test, deploy, and verify software as needed.

Is there an Harness outage?

Harness status: Systems Active

Harness outages and incidents

There have been 3 outages or incidents for Harness in the last 30 days.

Severity Breakdown:

Tired of searching for status updates?

Components and Services Monitored for Harness

Prod 1

Prod 2

Prod 3

Prod 4

Prod Eu1

Latest Harness outages and incidents.

Feature Flag target segments are failing in Prod-1/2 for customers with no target groups

Updates:

Pipelines Execution encountered issues

Updates:

Pipeline executions are failing in Prod-1/2

Updates:

Intermittent pipeline timeouts

Updates:

CG Secret manager is broken in Prod-2

Updates:

Check the status of similar companies and alternatives to Harness

Frequently Asked Questions - Harness

Is there a Harness outage?

Where can I find the official status page of Harness?

How can I get notified if Harness is down or experiencing an outage?

What does Harness do?

Start monitoring now!