Company Logo

Is there an Harness outage?

Harness status: Systems Active

Last checked: 6 minutes ago

Get notified about any outages, downtime or incidents for Harness and 1800+ other cloud vendors. Monitor 10 companies, for free.

Subscribe for updates

Harness outages and incidents

Outage and incident data over the last 30 days for Harness.

There have been 3 outages or incidents for Harness in the last 30 days.

Severity Breakdown:

Tired of searching for status updates?

Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!

Sign Up Now

Components and Services Monitored for Harness

Outlogger tracks the status of these components for Xero:

Service Reliability Management - Error Tracking FirstGen (fka OverOps) Active
Software Engineering Insights FirstGen (fka Propelo) Active
Chaos Engineering Active
Cloud Cost Management (CCM) Active
Continuous Delivery (CD) - FirstGen - EOS Active
Continuous Delivery - Next Generation (CDNG) Active
Continuous Error Tracking (CET) Active
Continuous Integration Enterprise(CIE) - Cloud Builds Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active
Custom Dashboards Active
Feature Flags (FF) Active
Infrastructure as Code Management (IaCM) Active
Internal Developer Portal (IDP) Active
Security Testing Orchestration (STO) Active
Service Reliability Management (SRM) Active
Software Engineering Insights (SEI) Active
Software Supply Chain Assurance (SSCA) Active
Chaos Engineering Active
Cloud Cost Management (CCM) Active
Continuous Delivery (CD) - FirstGen - EOS Active
Continuous Delivery - Next Generation (CDNG) Active
Continuous Error Tracking (CET) Active
Continuous Integration Enterprise(CIE) - Cloud Builds Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active
Custom Dashboards Active
Feature Flags (FF) Active
Infrastructure as Code Management (IaCM) Active
Internal Developer Portal (IDP) Active
Security Testing Orchestration (STO) Active
Service Reliability Management (SRM) Active
Software Engineering Insights (SEI) Active
Software Supply Chain Assurance (SSCA) Active
Chaos Engineering Active
Cloud Cost Management (CCM) Active
Continuous Delivery (CD) - FirstGen - EOS Active
Continuous Delivery - Next Generation (CDNG) Active
Continuous Error Tracking (CET) Active
Continuous Integration Enterprise(CIE) - Cloud Builds Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active
Custom Dashboards Active
Feature Flags (FF) Active
Infrastructure as Code Management (IaCM) Active
Internal Developer Portal (IDP) Active
Security Testing Orchestration (STO) Active
Service Reliability Management (SRM) Active
Software Supply Chain Assurance (SSCA) Active
Chaos Engineering Active
Cloud Cost Management (CCM) Active
Continuous Delivery - Next Generation (CDNG) Active
Continuous Error Tracking (CET) Active
Continuous Integration Enterprise(CIE) - Cloud Builds Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active
Custom Dashboards Active
Feature Flags (FF) Active
Infrastructure as Code Management (IaCM) Active
Internal Developer Portal (IDP) Active
Security Testing Orchestration (STO) Active
Service Reliability Management (SRM) Active
Chaos Engineering Active
Cloud Cost Management (CCM) Active
Continuous Delivery - Next Generation (CDNG) Active
Continuous Error Tracking (CET) Active
Continuous Integration Enterprise(CIE) - Cloud Builds Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active
Custom Dashboards Active
Feature Flags (FF) Active
Infrastructure as Code Management (IaCM) Active
Internal Developer Portal (IDP) Active
Security Testing Orchestration (STO) Active
Service Reliability Management (SRM) Active
Component Status
Service Reliability Management - Error Tracking FirstGen (fka OverOps) Active
Software Engineering Insights FirstGen (fka Propelo) Active
Active
Chaos Engineering Active
Cloud Cost Management (CCM) Active
Continuous Delivery (CD) - FirstGen - EOS Active
Continuous Delivery - Next Generation (CDNG) Active
Continuous Error Tracking (CET) Active
Continuous Integration Enterprise(CIE) - Cloud Builds Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active
Custom Dashboards Active
Feature Flags (FF) Active
Infrastructure as Code Management (IaCM) Active
Internal Developer Portal (IDP) Active
Security Testing Orchestration (STO) Active
Service Reliability Management (SRM) Active
Software Engineering Insights (SEI) Active
Software Supply Chain Assurance (SSCA) Active
Active
Chaos Engineering Active
Cloud Cost Management (CCM) Active
Continuous Delivery (CD) - FirstGen - EOS Active
Continuous Delivery - Next Generation (CDNG) Active
Continuous Error Tracking (CET) Active
Continuous Integration Enterprise(CIE) - Cloud Builds Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active
Custom Dashboards Active
Feature Flags (FF) Active
Infrastructure as Code Management (IaCM) Active
Internal Developer Portal (IDP) Active
Security Testing Orchestration (STO) Active
Service Reliability Management (SRM) Active
Software Engineering Insights (SEI) Active
Software Supply Chain Assurance (SSCA) Active
Active
Chaos Engineering Active
Cloud Cost Management (CCM) Active
Continuous Delivery (CD) - FirstGen - EOS Active
Continuous Delivery - Next Generation (CDNG) Active
Continuous Error Tracking (CET) Active
Continuous Integration Enterprise(CIE) - Cloud Builds Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active
Custom Dashboards Active
Feature Flags (FF) Active
Infrastructure as Code Management (IaCM) Active
Internal Developer Portal (IDP) Active
Security Testing Orchestration (STO) Active
Service Reliability Management (SRM) Active
Software Supply Chain Assurance (SSCA) Active
Active
Chaos Engineering Active
Cloud Cost Management (CCM) Active
Continuous Delivery - Next Generation (CDNG) Active
Continuous Error Tracking (CET) Active
Continuous Integration Enterprise(CIE) - Cloud Builds Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active
Custom Dashboards Active
Feature Flags (FF) Active
Infrastructure as Code Management (IaCM) Active
Internal Developer Portal (IDP) Active
Security Testing Orchestration (STO) Active
Service Reliability Management (SRM) Active
Active
Chaos Engineering Active
Cloud Cost Management (CCM) Active
Continuous Delivery - Next Generation (CDNG) Active
Continuous Error Tracking (CET) Active
Continuous Integration Enterprise(CIE) - Cloud Builds Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active
Custom Dashboards Active
Feature Flags (FF) Active
Infrastructure as Code Management (IaCM) Active
Internal Developer Portal (IDP) Active
Security Testing Orchestration (STO) Active
Service Reliability Management (SRM) Active

Latest Harness outages and incidents.

View the latest incidents for Harness and check for official updates:

Updates:

  • Time: June 27, 2024, 2:18 p.m.
    Status: Postmortem
    Update: # Summary Customers with no target groups configured were being returned `null` instead of `[]` for the /target-segments request when their sdks started up. This could lead to null pointer exceptions and a failure to initialise for some sdks. ## SDK Customer Impact Issue was related to a number of SDK versions, so we tested them and the latest versions to ascertain impact. Java * 1.3.1: * Behaviour: the `waitForInitialzation` call never unblocks once the exception is thrown and caught in the polling thread. This would have caused user code to “freeze” when the SDK is initialising. * Impact: critical impact, the SDK blocks users code from executing. * 1.6.0 - latest version: * Behaviour: An error is logged that the group was null and that the size of the group could not be calculated. Flags are loaded correctly and the `waitForInitialzation` call unblocks and evaluation calls return the correct variation. * Impact: no functional impact outside of error logs. The correct evaluation will be returned. Node.js * 1.3.1: * Behaviour: `UnhandledPromiseRejection` causes the SDK and application to crash. If the SDK client and `waitForInitialzation` were used in a try-catch block, then an error would be logged and the SDK would serve the correct evaluations. * Impact: * If no exception handling was used on the client, then critical impact and the user’s application would crash. * If exception handling was used, no functional impact outside of error logs. The correct evaluation will be returned. * 1.8.1 - latest version: same behaviour and impact as 1.3.1   Other SDK impact The remaining server SDKs have been tested on their latest versions to ascertain impact. While there were no direct customer reports of issues, this is useful to understand the scope of this issue. * Erlang: 3.0.0: Critical impact, exception thrown could cause an application not to start, depending on how the SDK has been integrated. * Python 1.6.2: High impact, SDK fails to initialise and serves default variations. * .NET 1.7.0: No impact, but error is logged. * Go v0.1.23: No impact. * Ruby 1.3.0: No impact. ## RCA ### Why did some customers experience null pointer exceptions? In the scenario where a customer had 0 target groups the /client/target-segments endpoint returned the value `null` instead of an empty array `[]` ### Why was null being returned instead of an empty array? A change was made in the db layer of the backend to return an empty array rather than a not found error when no target groups exist for a customer. This had the impact of hitting a different codepath. This codepath copies all groups into a new array and changes some data before marshalling and returning the json response. Because no groups existed this copy would mistakenly end up returning a nil object instead of an empty array, which then got marshalled into the `null` json response. ### Why was that change made to begin with? Because of our high request rates we use many layers of caching. A side effect of returning errors from the db layer when no target groups exist was that we wouldn’t cache that response. With some high volume customers having no target groups this led to tens of millions of unnecessary requests hitting the database per week when flags are evaluated which we were attempting to avoid. ### Why was this scenario not caught by tests? Unit tests, end to end tests and sdk specific tests exist for this endpoint however the case where target groups are empty wasn’t full covered. This change was primarily meant to improve performance for the /client/evaluations endpoint which uses this code path and which was manually tested and confirmed to work correctly. The /client/target-segments code path experiencing side affects from this change wasn’t anticipated or caught by automated testing. ## Follow up actions Followup actions can cover the following based on different issues faced along with Jira id’s linked for tracking the followup completion. these followup items must ALSO be linked in the RCA ticket * Test enhancements * Add unit tests for the target groups, working with both none, one and multiple * Add new validation and logging, to ensure valid JSON is already returned by the endpoints
  • Time: June 25, 2024, 5:16 p.m.
    Status: Resolved
    Update: Between 09:18 and 16:08 UTC, customers with no evaluation groups seeing `NullPointerException` errors in the SDK's, when pulling evaluation rules. In the scenario where a customer had 0 target groups the /client/target-segments endpoint returned the value null instead of an empty array []. A change was made in the db layer of the backend to return an empty array rather than a not found error when no target groups exist for a customer. This had the impact of hitting a different codepath. This codepath copies all groups into a new array and changes some data before marshalling and returning the json response. Because no groups existed this copy would mistakenly end up returning a nil object instead of an empty array, which then got marshalled into the null json response. We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.

Updates:

  • Time: June 5, 2024, 6:35 p.m.
    Status: Postmortem
    Update: ## **Summary:**  Pipeline executions across CI/CD are not advancing as anticipated with some even getting stuck in the prod2 cluster. ## **What was the issue?** One of the micro services was running low on resources due to a pipeline execution consuming more resources when attempting to resolve expressions. This led to all pipelines running slower by \(80%\) during the duration of the incident with few executions getting into an unresponsive state. ## **Timeline:** ## **Resolution:**  The pipeline execution that was consuming more resources was aborted and the service pods were restarted to recover the system. ## **RCA** A pipeline contained circular references in the files. The runtime resolution of these references resulted in excessive threads getting into the Waiting state. Although the issue was automatically detected \(and the system was auto-protected by breaking the circuit\), the excessive threads were still consumed due to a higher threshold on the loop detection logic. Since then, we have further reduced the loop threshold and automated blocking such runaway pipeline executions.
  • Time: June 4, 2024, 12:32 a.m.
    Status: Resolved
    Update: We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.
  • Time: June 4, 2024, 12:27 a.m.
    Status: Monitoring
    Update: We are continuing to monitor for any further issues.
  • Time: June 4, 2024, 12:17 a.m.
    Status: Monitoring
    Update: Harness service issues have been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.
  • Time: June 3, 2024, 11:49 p.m.
    Status: Identified
    Update: We have identified a potential cause of the pipleine service issues and are working hard to address it. Please continue to monitor this page for updates.

Updates:

  • Time: May 29, 2024, 6:20 p.m.
    Status: Postmortem
    Update: **What was the issue?** Pipeline executions were stuck and failing due to a Redis instance in the primary region \(us-west1\) becoming unresponsive. Impact was limited to pipeline executions in NextGen. ‌ **Timeline** ‌ | **Time** | **Event** | | --- | --- | | 28 May - 8:56 PM PDT | We received alert on high memory utilization on our Redis instance. Team identified a particularly large cache key and attempted to manually delete the key. This resulted in the Redis instance becoming unstable. Pipeline executions began failing or hanging. We are still working with our Redis service provider to understand why key deletion resulted in instance unresponsiveness. Such key deletions have been done previously without issue, consistent with vendor advice. | | 28 May - 9:15 PM PDT | We engaged Redis support team, and executed a mitigation plan. | | 28 May - 9:40 PM PDT | We failed over our application to the secondary Redis instance. The system started to recover. Unfortunately, the secondary region also went into bad state as we migrated load. | | 28 May - 10:02 PM PDT | The primary redis instance became healthy. | | 28 May - 10:25 PM PDT | Application traffic was migrated back to primary Redis region. This restored functionality. | ‌ **RCA & Action Items:** Pipelines started failing due to Redis instance becoming unresponsive in the primary region. This was due to manual deletion of a large key. We are working with the vendor to determine why this action caused instability. We will share more details when they become available. ‌ As part of the immediate action items, we are implementing an upper bound on cache key size. This will prevent us from getting into the state of high memory usage due to large cache keys. In the longer term, we are revisiting our architecture to eliminate large keys in Redis altogether.
  • Time: May 29, 2024, 7:21 a.m.
    Status: Resolved
    Update: Harness services are now stable, and our internal sanity check has passed. We will publish more details as soon as our vendor partner, Redis shares the RCA with us. We apologise for the disruption of service.
  • Time: May 29, 2024, 5:37 a.m.
    Status: Monitoring
    Update: Service issues have been addressed and normal operations has been resumed. We are monitoring the service to ensure normal performance continues. Thanks you for your patience!
  • Time: May 29, 2024, 5:29 a.m.
    Status: Identified
    Update: We have identified the issue to be with redis cache. We are working with the vendor to get this fixed and team is working with utmost urgency to get this resolved
  • Time: May 29, 2024, 4:36 a.m.
    Status: Identified
    Update: Pipeline executions are failing in Prod-1/2 due to dependency failure.

Updates:

  • Time: June 4, 2024, 11:28 p.m.
    Status: Postmortem
    Update: ## What was the issue? Pipelines in Prod1 were experiencing intermittent failures caused by gRPC connection issues between Harness services. The majority of the failed gRPC requests occurred between the CI Manager and Harness Manager \(CG\), resulting in a primary impact on CI pipelines. ## Timeline | **Time** | **Event** | | --- | --- | | May 28, 9:50 AM PDT | The issue was reported by a customer regarding intermittent pipeline failures. The team initiated an investigation but did not identify any issues with the infrastructure. It was determined that the failure was isolated to one specific customer. Teams were promptly alerted to monitor the pipelines of this particular customer. | | May 28, 3:05 PM PDT | The engineering team has observed a few more number occurrences of the issue across other pipelines. | | May 28, 3:30 PM PDT | The team has made the decision to rollback a recent deployment in order to investigate any potential correlations. | | May 28, 5:15 PM PDT | The status for the Prod1 environment has been updated to "degraded performance" due to intermittent issues. | | May 28, 6:20 PM PDT | The issue was suspected to be related to kube DNS resolution, resulting in some gRPC requests failing randomly. GCP support was engaged for further investigation. Additionally, service thread dumps were captured for internal debugging purposes, revealing no suspicious findings within the thread dumps. | | May 28, 7:00 PM PDT | Declared status to “monitoring”. | ## RCA and Action Items: Pipelines experienced failures as a result of internal service communication issues, despite multiple attempts. Initially, the Engineering team suspected kube-DNS problems; however, after consulting with GCP support, this was ruled out. It was noted that certain service pod replicas were receiving an uneven distribution of requests. To tackle this issue, the following corrective actions are currently underway: 1. Enhancing load balancing for gRPC calls among service pods. 2. Incorporating traceId in delegate task submissions. Furthermore, we have set up alerts related to gRPC to address similar situations in the future.
  • Time: May 29, 2024, 6:06 a.m.
    Status: Resolved
    Update: We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.
  • Time: May 29, 2024, 2:01 a.m.
    Status: Monitoring
    Update: Harness service issues have been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.
  • Time: May 29, 2024, 1:21 a.m.
    Status: Identified
    Update: We have identified that the intermittent service disruption is limited to a small number of customers. Our team is actively engaged in investigating the root cause of the issue and is working diligently to restore full functionality as quickly as possible. We understand the importance of service reliability and the inconvenience this may cause for our affected customers. We apologize for any disruption this may have caused and appreciate your patience during this time. We will provide further updates as soon as we have more information or when the issue has been fully resolved.
  • Time: May 29, 2024, 12:15 a.m.
    Status: Investigating
    Update: Current Status: The team is actively investigating intermittent timeouts occurring in our pipeline services. Engineers are working to identify the root cause and implement a fix. User Impact: Some users may experience delays or failures when trying to access or use pipeline-related functionality. Not all users are impacted. We apologize for any inconvenience caused by these pipeline service timeouts. The team is treating this as a high priority issue and is working diligently to restore full performance and stability as quickly as possible. Your patience is appreciated as we work through this matter.

Updates:

  • Time: May 6, 2024, 6:47 a.m.
    Status: Postmortem
    Update: ## Summary A few trial Current Gen\(CG\) CD customers had difficulty Login and accessing Harness Secret Manager in our Prod-2 environment. There was no impact on Next Gen\(NG\) customers. ## Timeline | **Time \(UTC\)** | **Event** | | --- | --- | | 04:22 am | We received internal alerts for an increased error rate related to Secret Manager. | | 06:09 am | An incident was raised when a trial customer reached out. | | 06:28 am | The root cause was identified. | | 07:34 am | Incident was resolved | ## Resolution Upon referencing the code base, we identified a key configuration missing in our database records. The required data was restored from periodic snapshots. ## RCA As a part of the official EOL for CG CD, a cleanup activity in the backend database for all internal accounts was done. In the cleanup process, a legacy configuration which the **Harness Secret Manager** used was deleted. ## Action Items We will perform stringent checks on data before cleanup, followed by sanity to ensure no functionality gets impacted.
  • Time: May 3, 2024, 7:22 a.m.
    Status: Resolved
    Update: The issue with CG Secret manager has been resolved for all customers as of May 03, 2024 - 00:10 PDT. We thank you for your patience while we worked in resolving the issue.
  • Time: May 3, 2024, 7:10 a.m.
    Status: Monitoring
    Update: Our engineering team has resolved the issue , We are actively monitoring the issue.
  • Time: May 3, 2024, 6:19 a.m.
    Status: Identified
    Update: The issue has been identified and a fix is being implemented.

Check the status of similar companies and alternatives to Harness

UiPath
UiPath

Systems Active

Scale AI
Scale AI

Systems Active

Notion
Notion

Systems Active

Brandwatch
Brandwatch

Systems Active

Olive AI
Olive AI

Systems Active

Sisense
Sisense

Systems Active

HeyJobs
HeyJobs

Systems Active

Joveo
Joveo

Systems Active

Seamless AI
Seamless AI

Systems Active

hireEZ
hireEZ

Systems Active

Alchemy
Alchemy

Systems Active

Frequently Asked Questions - Harness

Is there a Harness outage?
The current status of Harness is: Systems Active
Where can I find the official status page of Harness?
The official status page for Harness is here
How can I get notified if Harness is down or experiencing an outage?
To get notified of any status changes to Harness, simply sign up to OutLogger's free monitoring service. OutLogger checks the official status of Harness every few minutes and will notify you of any changes. You can veiw the status of all your cloud vendors in one dashboard. Sign up here
What does Harness do?
Harness is a software delivery platform that enables engineers and DevOps to build, test, deploy, and verify software as needed.