Harness Status: Check if Harness down or having an outage.

Harness outages and incidents

Outage and incident data over the last 30 days for Harness.

There have been 3 outages or incidents for Harness in the last 30 days.

Severity Breakdown:

None: 0

Minor: 3

Major: 0

Critical: 0

Tired of searching for status updates?

Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!

Components and Services Monitored for Harness

Outlogger tracks the status of these components for Xero:

Service Reliability Management - Error Tracking FirstGen (fka OverOps) Active

Software Engineering Insights FirstGen (fka Propelo) Active

Prod 1

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery (CD) - FirstGen - EOS Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Software Engineering Insights (SEI) Active

Software Supply Chain Assurance (SSCA) Active

Prod 2

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery (CD) - FirstGen - EOS Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Software Engineering Insights (SEI) Active

Software Supply Chain Assurance (SSCA) Active

Prod 3

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery (CD) - FirstGen - EOS Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Software Supply Chain Assurance (SSCA) Active

Prod 4

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Prod Eu1

Chaos Engineering Active

Cloud Cost Management (CCM) Active

Continuous Delivery - Next Generation (CDNG) Active

Continuous Error Tracking (CET) Active

Continuous Integration Enterprise(CIE) - Cloud Builds Active

Continuous Integration Enterprise(CIE) - Linux Cloud Builds Active

Continuous Integration Enterprise(CIE) - Self Hosted Runners Active

Continuous Integration Enterprise(CIE) - Windows Cloud Builds Active

Custom Dashboards Active

Feature Flags (FF) Active

Infrastructure as Code Management (IaCM) Active

Internal Developer Portal (IDP) Active

Security Testing Orchestration (STO) Active

Service Reliability Management (SRM) Active

Component	Status
Service Reliability Management - Error Tracking FirstGen (fka OverOps)	Active
Software Engineering Insights FirstGen (fka Propelo)	Active
Prod 1	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery (CD) - FirstGen - EOS	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active
Software Engineering Insights (SEI)	Active
Software Supply Chain Assurance (SSCA)	Active
Prod 2	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery (CD) - FirstGen - EOS	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active
Software Engineering Insights (SEI)	Active
Software Supply Chain Assurance (SSCA)	Active
Prod 3	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery (CD) - FirstGen - EOS	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active
Software Supply Chain Assurance (SSCA)	Active
Prod 4	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active
Prod Eu1	Active
Chaos Engineering	Active
Cloud Cost Management (CCM)	Active
Continuous Delivery - Next Generation (CDNG)	Active
Continuous Error Tracking (CET)	Active
Continuous Integration Enterprise(CIE) - Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Linux Cloud Builds	Active
Continuous Integration Enterprise(CIE) - Self Hosted Runners	Active
Continuous Integration Enterprise(CIE) - Windows Cloud Builds	Active
Custom Dashboards	Active
Feature Flags (FF)	Active
Infrastructure as Code Management (IaCM)	Active
Internal Developer Portal (IDP)	Active
Security Testing Orchestration (STO)	Active
Service Reliability Management (SRM)	Active

Latest Harness outages and incidents.

View the latest incidents for Harness and check for official updates:

Issue with sending Git Status for PR

Description: # Summary CI pipelines using git connectors with a `Delegate` as the mode of connection did not update the status back to SCM/Git providers. This did not impact those pipelines whose git connectors are connected via the `Harness Platform`. # Mitigation Steps to resolve the issue immediately 1. Rolled back delegate for rings to which customers belong. 2. Rolled back delegate for all rings. ‌ # Detailed Timeline\(PST | **Time** | **Event** | | --- | --- | | 10/05/2023 | | | 3:30 AM | Delegate deployment | | 10:53 AM | First customer ticket for status check not reporting back | | 11-12 AM | Two more customer ticket for the same issue | | 12:20 PM | Incident channel created | | 2:20 PM | Reproduced the issue PR checks reporting not working via delegate | | 2:40 PM | Rolled back delegate in rings | | 3:30 PM | Informed customer about rollback | | 3:40PM | Customer confirmed restoration | ## RCA ### Why didn't certain CI pipelines update SCM/Git status? * The delegate task created to update the status on Git failed causing the status to be not reflected on Git provider. ### Why did the delegate task fail? * A change made to support the new Harness Code module introduced a missing dependency that failed during run-time. ### Why didn't it impact all the CI pipelines? * Harness provides two ways to connect to Git providers - via Harness Platform or via Harness Delegate. Only the CI pipelines using git connectors with a Delegate as the mode of connection did not update the status back to SCM/Git providers. This did not impact those pipelines whose git connectors are connected via the Harness Platform. This is because the missing dependency issue was in the Delegate. ### Why was the missing dependency not predicted in the testing phase? * We have automated tests that check running pipelines via Harness Delegate but the tests to check Git status updates in this mode of connectivity were missing. ## Steps taken Harness CI engineers tried to reproduce in-house with various combinations of infrastructure such as Kubernetes, Harness Cloud, Virtual Machines, etc but it took some time to realize this happens only when the git connector is set up to be connected via Harness Delegate instead of Harness Platform. As soon as we realized this, we engaged the delegate engineering team and they helped with reverting the delegate version to a previous one that did not have this code. ## Follow-up actions Add automation to catch this case and also set up internal alerts when the issue happens so things can be handled proactively.

Status: Postmortem

Impact: Minor | Started At: Oct. 4, 2023, 7:20 p.m.

Updates:

Time: Oct. 6, 2023, 11:19 p.m.

Status: Postmortem

Update: # Summary CI pipelines using git connectors with a `Delegate` as the mode of connection did not update the status back to SCM/Git providers. This did not impact those pipelines whose git connectors are connected via the `Harness Platform`. # Mitigation Steps to resolve the issue immediately 1. Rolled back delegate for rings to which customers belong. 2. Rolled back delegate for all rings. ‌ # Detailed Timeline\(PST | **Time** | **Event** | | --- | --- | | 10/05/2023 | | | 3:30 AM | Delegate deployment | | 10:53 AM | First customer ticket for status check not reporting back | | 11-12 AM | Two more customer ticket for the same issue | | 12:20 PM | Incident channel created | | 2:20 PM | Reproduced the issue PR checks reporting not working via delegate | | 2:40 PM | Rolled back delegate in rings | | 3:30 PM | Informed customer about rollback | | 3:40PM | Customer confirmed restoration | ## RCA ### Why didn't certain CI pipelines update SCM/Git status? * The delegate task created to update the status on Git failed causing the status to be not reflected on Git provider. ### Why did the delegate task fail? * A change made to support the new Harness Code module introduced a missing dependency that failed during run-time. ### Why didn't it impact all the CI pipelines? * Harness provides two ways to connect to Git providers - via Harness Platform or via Harness Delegate. Only the CI pipelines using git connectors with a Delegate as the mode of connection did not update the status back to SCM/Git providers. This did not impact those pipelines whose git connectors are connected via the Harness Platform. This is because the missing dependency issue was in the Delegate. ### Why was the missing dependency not predicted in the testing phase? * We have automated tests that check running pipelines via Harness Delegate but the tests to check Git status updates in this mode of connectivity were missing. ## Steps taken Harness CI engineers tried to reproduce in-house with various combinations of infrastructure such as Kubernetes, Harness Cloud, Virtual Machines, etc but it took some time to realize this happens only when the git connector is set up to be connected via Harness Delegate instead of Harness Platform. As soon as we realized this, we engaged the delegate engineering team and they helped with reverting the delegate version to a previous one that did not have this code. ## Follow-up actions Add automation to catch this case and also set up internal alerts when the issue happens so things can be handled proactively.
Time: Oct. 4, 2023, 10:18 p.m.

Status: Resolved

Update: This incident has been resolved.
Time: Oct. 4, 2023, 10:18 p.m.

Status: Identified

Update: The incident has been resolved. Please reach out to Harness Support if you continue to see this behavior.
Time: Oct. 4, 2023, 9:51 p.m.

Status: Identified

Update: We have identified the issue and are currently rolling back to a previous version of the delegate.
Time: Oct. 4, 2023, 9:50 p.m.

Status: Investigating

Update: Impact: CI pipelines running on self-hosted infra, and using the latest delegate (v23.09.80804) that use Git Connectors with '''executeOnDelegate: true''''' will fail to see Git status updates. Workaround: Use '''executeOnDelegate: false''' (Connect via Harness Platform).
Time: Oct. 4, 2023, 9:50 p.m.

Status: Investigating

Update: The team has been notified of an issue and is currently investigating.

Few users are observing slowness while executing Harness CI and CD Pipelines

Description: # Summary CI / CD pipeline executions slowed down or timed out because of the high latency of Redis calls from the log service. ‌ # Mitigation 1. Engaged Redis support 2. Increased shards for log service 3. Monitored primary shard until CPU usage returned to the expected range ‌ # Detailed Timeline\(PST\) | **Time** | **Event** | **Notes** | | --- | --- | --- | | 8:42 AM | Firehydrant triggered for CI pipeline performance degradation | | | 8:45 AM | Checked Redis memory - it was under the limit | | | 9:05 AM | Figured out that the stream write calls are taking a very long time hence resulting in longer execution times | | | 9:25 AM | Created a P0 with Redis Support | | | 9:50 AM | Increase log-service memory | Since writes were taking longer, the API payloads were still present in log-service increasing the memory usage | | 10:30 AM | Redis support joined the call Requested shard logs to understand what’s causing the high latency for Redis operations | Explained the chain of events including Redis memory increase from last week. | | 10:51 AM | Deployed a change to decrease the number of lines in a log stream | Created a temporary fix to decrease the size per stream | | 11:15 AM | Discovered 100% CPU utilization on Redis shard Performed failover in an attempt to decrease the CPU utilization - did not help | CPU was at 100% since Friday \(including the weekend when the load is low\) | | 12PM - 1:30 PM | Gradually increased the number of shards Received logs for 30 second time window for the hot shard \(requested at 10:30 AM\) | Saw the keys getting distributed evenly across shards but the CPU utilization of the host shard did not come down Hot shard - 100%, Other shards - 30-40% Saw some CRDT operation logs in the shard logs | | 2:30 PM | Redis team still investigating the issue. Requested for all shard logs and CRDT sync logs. | | | 3:34 PM | Received logs for 30 second time window for all shards | | | 4:01 PM | Redis team pointed out that even though the keys were distributed evenly, the hot shard was consuming more than twice the memory of the newly provisioned shards | Harness team pointed out that the replication was out of sync and there were a lot of CRDT.MERGE logs in the hot shard logs which were missing in the other shard logs | | 4:08 PM | failover and primary got in sync | | | 4:12 PM | CPU utilization started dropping along with the high memory usage on the hot shard | Incident was marked as resolved | ‌ # RCA ### Why were the pipelines running slow? The current Redis shard was not able to handle the log streaming load and its CPU was running at 100% and causing higher latency. ### Why was the Redis shard CPU running at 100%? Observations based on the call with the Redis support team 1. On 09/29, We noticed the Redis instance was running close to capacity. CloudOps team increased the size of the Redis instance to accommodate for increased load. Unfortunately, the team failed to make corresponding changes in the secondary cluster. 2. As per [log-service DB alerts](https://docs.google.com/spreadsheets/d/1uw9U-bqlaXZUv44jEW64tMHd2mU2s5oVXVIftQ3QaQ0/edit?usp=sharing) in Redis, the sync was failing since 9/29 and did not recover until Monday 10/02 \(OOM and connection errors\) 3. We suspect the sync process might be in a bad state resulting in keeping the CPU at 100% during the weekend too. We are awaiting a detailed RCA from Redis support. ‌ # Follow up actions 1. Updated the memory of failover to match with primary 2. Increased the number of shards 3. Working on enhancing monitoring and alerts around latency spikes

Status: Postmortem

Impact: Minor | Started At: Oct. 2, 2023, 5:06 p.m.

Updates:

Time: Oct. 6, 2023, 10:26 a.m.

Status: Postmortem

Update: # Summary CI / CD pipeline executions slowed down or timed out because of the high latency of Redis calls from the log service. ‌ # Mitigation 1. Engaged Redis support 2. Increased shards for log service 3. Monitored primary shard until CPU usage returned to the expected range ‌ # Detailed Timeline\(PST\) | **Time** | **Event** | **Notes** | | --- | --- | --- | | 8:42 AM | Firehydrant triggered for CI pipeline performance degradation | | | 8:45 AM | Checked Redis memory - it was under the limit | | | 9:05 AM | Figured out that the stream write calls are taking a very long time hence resulting in longer execution times | | | 9:25 AM | Created a P0 with Redis Support | | | 9:50 AM | Increase log-service memory | Since writes were taking longer, the API payloads were still present in log-service increasing the memory usage | | 10:30 AM | Redis support joined the call Requested shard logs to understand what’s causing the high latency for Redis operations | Explained the chain of events including Redis memory increase from last week. | | 10:51 AM | Deployed a change to decrease the number of lines in a log stream | Created a temporary fix to decrease the size per stream | | 11:15 AM | Discovered 100% CPU utilization on Redis shard Performed failover in an attempt to decrease the CPU utilization - did not help | CPU was at 100% since Friday \(including the weekend when the load is low\) | | 12PM - 1:30 PM | Gradually increased the number of shards Received logs for 30 second time window for the hot shard \(requested at 10:30 AM\) | Saw the keys getting distributed evenly across shards but the CPU utilization of the host shard did not come down Hot shard - 100%, Other shards - 30-40% Saw some CRDT operation logs in the shard logs | | 2:30 PM | Redis team still investigating the issue. Requested for all shard logs and CRDT sync logs. | | | 3:34 PM | Received logs for 30 second time window for all shards | | | 4:01 PM | Redis team pointed out that even though the keys were distributed evenly, the hot shard was consuming more than twice the memory of the newly provisioned shards | Harness team pointed out that the replication was out of sync and there were a lot of CRDT.MERGE logs in the hot shard logs which were missing in the other shard logs | | 4:08 PM | failover and primary got in sync | | | 4:12 PM | CPU utilization started dropping along with the high memory usage on the hot shard | Incident was marked as resolved | ‌ # RCA ### Why were the pipelines running slow? The current Redis shard was not able to handle the log streaming load and its CPU was running at 100% and causing higher latency. ### Why was the Redis shard CPU running at 100%? Observations based on the call with the Redis support team 1. On 09/29, We noticed the Redis instance was running close to capacity. CloudOps team increased the size of the Redis instance to accommodate for increased load. Unfortunately, the team failed to make corresponding changes in the secondary cluster. 2. As per [log-service DB alerts](https://docs.google.com/spreadsheets/d/1uw9U-bqlaXZUv44jEW64tMHd2mU2s5oVXVIftQ3QaQ0/edit?usp=sharing) in Redis, the sync was failing since 9/29 and did not recover until Monday 10/02 \(OOM and connection errors\) 3. We suspect the sync process might be in a bad state resulting in keeping the CPU at 100% during the weekend too. We are awaiting a detailed RCA from Redis support. ‌ # Follow up actions 1. Updated the memory of failover to match with primary 2. Increased the number of shards 3. Working on enhancing monitoring and alerts around latency spikes
Time: Oct. 3, 2023, 12:16 a.m.

Status: Resolved

Update: This incident has been resolved.
Time: Oct. 2, 2023, 11:22 p.m.

Status: Monitoring

Update: We have a solution, and the pipelines should be completed within the standard execution times; we are actively monitoring and will keep the status updated.
Time: Oct. 2, 2023, 10:04 p.m.

Status: Identified

Update: After increasing the number of shards on our caching database, we see most executions are now completing within the standard execution times. We are still seeing high CPUs on one of the shards, and we will continue working with our caching-managed service vendor to address this. We will keep you updated on the progress.
Time: Oct. 2, 2023, 8:43 p.m.

Status: Investigating

Update: We identified the caching component as the one that is the bottleneck at this point. We are closely working on the support to increase the number of shards on our caching database to mitigate the issue. We will keep you updated on the progress.
Time: Oct. 2, 2023, 5:06 p.m.

Status: Investigating

Update: Pipeline execution can be running slowly due to an on going issue with the log streaming.

Users from Americas region are not able to login to Harness SEI

Description: ## Impact Users were unable to log in to [app.propelo.ai](http://app.propelo.ai/) and [app.levelops.io](http://app.levelops.io/) through the [https://app.propelo.ai/signin](https://app.propelo.ai/signin) flow. Data ingestion, data processing, propels and any user already logged in in the US region were unaffected. The EMEA and APAC regions were not affected. **Workaround**: The users would still be able to log into the system via [SEI](https://app.propelo.ai/auth/login-page) flow as this issue impacts only the **/signin** flow of the application ‌ ## Root Cause One of the login flows is affected by tenant deletion tasks as the login flow looks at all the tenants in order to find out which tenants the user trying to log in has access to. * If the users table of the available tenants doesn't exist, the flow fails entirely. * During tenant deletions, several tenants were removed but the entry for the tenant to be considered 'available' was not deleted. * The connection to the DB was severed due to a VPN disconnect. ‌ ## Timeline | **Time** | **Event** | | --- | --- | | 2023-10-02 05:45 AM PDT | The issue was resolved | | 2023-10-02 05:40 AM PDT | The tenants which were marked for deletion were removed from the global list of tenants and operations are normal post the deletion of these unused tenant ids. | | 2023-10-02 05:10 AM PDT | Issue was identified via internal testing and incident was triggered. | ‌ ## Action Items * Institute a downtime window and alerting mechanism to stakeholders on the maintenance activity. * Also, perform verification / run sanity tests across tenants in respective regions to ensure the app is up and running. * Review the logic for /login and add any guardrails w.r.t the check on the global tenants list. * Institute a process/tool for tenant de-provision on the legacy SEI module.

Status: Postmortem

Impact: Major | Started At: Oct. 2, 2023, 12:10 p.m.

Updates:

Time: Oct. 3, 2023, 7:21 a.m.

Status: Postmortem

Update: ## Impact Users were unable to log in to [app.propelo.ai](http://app.propelo.ai/) and [app.levelops.io](http://app.levelops.io/) through the [https://app.propelo.ai/signin](https://app.propelo.ai/signin) flow. Data ingestion, data processing, propels and any user already logged in in the US region were unaffected. The EMEA and APAC regions were not affected. **Workaround**: The users would still be able to log into the system via [SEI](https://app.propelo.ai/auth/login-page) flow as this issue impacts only the **/signin** flow of the application ‌ ## Root Cause One of the login flows is affected by tenant deletion tasks as the login flow looks at all the tenants in order to find out which tenants the user trying to log in has access to. * If the users table of the available tenants doesn't exist, the flow fails entirely. * During tenant deletions, several tenants were removed but the entry for the tenant to be considered 'available' was not deleted. * The connection to the DB was severed due to a VPN disconnect. ‌ ## Timeline | **Time** | **Event** | | --- | --- | | 2023-10-02 05:45 AM PDT | The issue was resolved | | 2023-10-02 05:40 AM PDT | The tenants which were marked for deletion were removed from the global list of tenants and operations are normal post the deletion of these unused tenant ids. | | 2023-10-02 05:10 AM PDT | Issue was identified via internal testing and incident was triggered. | ‌ ## Action Items * Institute a downtime window and alerting mechanism to stakeholders on the maintenance activity. * Also, perform verification / run sanity tests across tenants in respective regions to ensure the app is up and running. * Review the logic for /login and add any guardrails w.r.t the check on the global tenants list. * Institute a process/tool for tenant de-provision on the legacy SEI module.
Time: Oct. 3, 2023, 6:56 a.m.

Status: Resolved

Update: We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.
Time: Oct. 3, 2023, 6:56 a.m.

Status: Monitoring

Update: Harness SEI login flow issue has been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.
Time: Oct. 3, 2023, 6:55 a.m.

Status: Identified

Update: We have identified a potential cause of the login issue and are working hard to address it. Please continue to monitor this page for updates.
Time: Oct. 3, 2023, 6:55 a.m.

Status: Investigating

Update: A few users are facing issues while trying to log in to Harness SEI through one of the login flows. We are working to identify the cause and restore normal operations as soon as possible.

Docker Hub Registry timing out

Description: Root Cause was due to DockerHub incident [https://www.dockerstatus.com/pages/533c6539221ae15e3f000031](https://www.dockerstatus.com/pages/533c6539221ae15e3f000031)

Status: Postmortem

Impact: None | Started At: Sept. 28, 2023, 9:16 p.m.

Updates:

Time: Oct. 25, 2023, 3:28 a.m.

Status: Postmortem

Update: Root Cause was due to DockerHub incident [https://www.dockerstatus.com/pages/533c6539221ae15e3f000031](https://www.dockerstatus.com/pages/533c6539221ae15e3f000031)
Time: Sept. 29, 2023, 12:24 a.m.

Status: Resolved

Update: We are resolving the incident on the DockerHub side for now. Please follow DockerHub status page for any further updates
Time: Sept. 28, 2023, 11:31 p.m.

Status: Monitoring

Update: Docker has identified and mitigated the issue and they are actively monitoring for any recurrences, and we will continue to monitor their status in parallel. Latest update from the provider: We have mitigated the issue as of about 21:50 UTC, but will continue to monitor until an underlying incident at our provider is cleared.
Time: Sept. 28, 2023, 9:16 p.m.

Status: Investigating

Update: Currently, DockerHub is having an active incident of timeouts while authenticating or pulling from Dockerhub https://www.dockerstatus.com/pages/incident/533c6539221ae15e3f000031/6515dbc89fabff05350ae18d. Harness customers using Docker in the pipelines may run into issues because of the ongoing DockerHub incident. Please follow the DockerHub status page for further updates.

CCM Perspectives and Dashboards are not working

Description: # Overview All customers experienced disruptions with Dashboards and Perspectives. However, users could still access CCM and utilize other sections of the system. # Timeline \(PST\) | **Time** | **Event** | | --- | --- | | 10:17 AM Sept 28 2023 | Users reporting slowness in Dashboards and perspectives. Engineering team started the investigation. | | 11:46 AM Sept 28 2023 | Finds out GCP Bigquery is having autoscaling issues for muti-region US. | | 12:52 PM Sept 28 2023 | The usage of BigQuery has been redirected to a different service account from another GCP project that employs on-demand pricing. | | ~2:30 PM Sept 28 2023 | Perspectives have been restored and are now operational. Dashboards are functioning to some extent, though a few customers are still experiencing issues. | | 5:12 PM Sept 28 2023 | Google reports BigQuery issue resolved | | 5:12 PM Sept 28 2023 | Dashboards have been fully restored and are now operational for all customers. | # Resolution Redirecting BigQuery's service account to a different GCP project with on-demand billing resolved the problem. # Affected Users Users in Prod1 and Prod2 were impacted. Only the Perspectives and Dashboards features were affected, while the rest of CCM operated without issues. # RCA Google's BigQuery employs slot AutoScaling to enhance slot availability for better performance. An incident with BigQuery hindered the slot AutoScaling functionality. Since CCM's perspectives and dashboards rely on BigQuery, the incident impacted their query response times.

Status: Postmortem

Impact: Critical | Started At: Sept. 28, 2023, 8:23 p.m.

Updates:

Time: Oct. 24, 2023, 7 p.m.

Status: Postmortem

Update: # Overview All customers experienced disruptions with Dashboards and Perspectives. However, users could still access CCM and utilize other sections of the system. # Timeline \(PST\) | **Time** | **Event** | | --- | --- | | 10:17 AM Sept 28 2023 | Users reporting slowness in Dashboards and perspectives. Engineering team started the investigation. | | 11:46 AM Sept 28 2023 | Finds out GCP Bigquery is having autoscaling issues for muti-region US. | | 12:52 PM Sept 28 2023 | The usage of BigQuery has been redirected to a different service account from another GCP project that employs on-demand pricing. | | ~2:30 PM Sept 28 2023 | Perspectives have been restored and are now operational. Dashboards are functioning to some extent, though a few customers are still experiencing issues. | | 5:12 PM Sept 28 2023 | Google reports BigQuery issue resolved | | 5:12 PM Sept 28 2023 | Dashboards have been fully restored and are now operational for all customers. | # Resolution Redirecting BigQuery's service account to a different GCP project with on-demand billing resolved the problem. # Affected Users Users in Prod1 and Prod2 were impacted. Only the Perspectives and Dashboards features were affected, while the rest of CCM operated without issues. # RCA Google's BigQuery employs slot AutoScaling to enhance slot availability for better performance. An incident with BigQuery hindered the slot AutoScaling functionality. Since CCM's perspectives and dashboards rely on BigQuery, the incident impacted their query response times.
Time: Sept. 29, 2023, 12:36 a.m.

Status: Resolved

Update: This incident has been resolved.
Time: Sept. 28, 2023, 11:33 p.m.

Status: Monitoring

Update: We are continuing to monitor the workaround in place for any issues. Additionally, we are seeing responses come back from our primary upstream provider but are awaiting additional status update/confirmation from the provider(GCP) before affirmation.
Time: Sept. 28, 2023, 9:51 p.m.

Status: Monitoring

Update: Harness has recently implemented a workaround for the majority of customers which should restore CCM Perspectives as well as Custom Dashboards. Our primary upstream provider(GCP) however, while showing some signs of recovery is still encountering issues. Harness will continue to provide status updates as we closely monitor the workaround we have put in place to mitigate issues related to CCM Perspectives and Dashboards until the incident is fully resolved. (Note: Existing GCP/CCM connector validation/test functionality will error out due to the workaround in place, but does not affect product functionality)
Time: Sept. 28, 2023, 9:39 p.m.

Status: Identified

Update: Harness has recently implemented a workaround for the majority of customers which should restore CCM Perspectives as well as Dashboards. Our primary upstream provider is still experiencing issues and is showing signs of recovery, and we expect an update within the next 30 minutes to the upstream status. Note: Existing GCP/CCM connector validation/tests will error out due to the workaround in place.
Time: Sept. 28, 2023, 8:53 p.m.

Status: Identified

Update: We identify that the issue we are experiencing is related to the Google Big Query incident: https://status.cloud.google.com/incidents/V8br4RDzg1RsCw6zWQEv
Time: Sept. 28, 2023, 8:23 p.m.

Status: Investigating

Update: CCM Perspectives and Dashboards are down because of a GCP Bigquery incident. We are currently investigating this issue.

Check the status of similar companies and alternatives to Harness

UiPath

Systems Active

Scale AI

Systems Active

Notion

Systems Active

Brandwatch

Systems Active

Olive AI

Systems Active

Sisense

Systems Active

HeyJobs

Systems Active

Joveo

Systems Active

Seamless AI

Systems Active

EdCast by Cornerstone

Issues Detected

hireEZ

Systems Active

Alchemy

Systems Active

Frequently Asked Questions - Harness

Is there a Harness outage?

The current status of Harness is: Systems Active

Where can I find the official status page of Harness?

The official status page for Harness is here

How can I get notified if Harness is down or experiencing an outage?

To get notified of any status changes to Harness, simply sign up to OutLogger's free monitoring service. OutLogger checks the official status of Harness every few minutes and will notify you of any changes. You can veiw the status of all your cloud vendors in one dashboard. Sign up here

What does Harness do?

Harness is a software delivery platform that enables engineers and DevOps to build, test, deploy, and verify software as needed.

Is there an Harness outage?

Harness status: Systems Active

Harness outages and incidents

There have been 3 outages or incidents for Harness in the last 30 days.

Severity Breakdown:

Tired of searching for status updates?

Components and Services Monitored for Harness

Prod 1

Prod 2

Prod 3

Prod 4

Prod Eu1

Latest Harness outages and incidents.

Issue with sending Git Status for PR

Updates:

Few users are observing slowness while executing Harness CI and CD Pipelines

Updates:

Users from Americas region are not able to login to Harness SEI

Updates:

Docker Hub Registry timing out

Updates:

CCM Perspectives and Dashboards are not working

Updates:

Check the status of similar companies and alternatives to Harness

Frequently Asked Questions - Harness

Is there a Harness outage?

Where can I find the official status page of Harness?

How can I get notified if Harness is down or experiencing an outage?

What does Harness do?

Start monitoring now!