Last checked: 7 minutes ago
Get notified about any outages, downtime or incidents for Harness and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Harness.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Service Reliability Management - Error Tracking FirstGen (fka OverOps) | Active |
Software Engineering Insights FirstGen (fka Propelo) | Active |
Prod 1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 2 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 3 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 4 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Prod Eu1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
View the latest incidents for Harness and check for official updates:
Description: # Summary CI pipelines using git connectors with a `Delegate` as the mode of connection did not update the status back to SCM/Git providers. This did not impact those pipelines whose git connectors are connected via the `Harness Platform`. # Mitigation Steps to resolve the issue immediately 1. Rolled back delegate for rings to which customers belong. 2. Rolled back delegate for all rings. # Detailed Timeline\(PST | **Time** | **Event** | | --- | --- | | 10/05/2023 | | | 3:30 AM | Delegate deployment | | 10:53 AM | First customer ticket for status check not reporting back | | 11-12 AM | Two more customer ticket for the same issue | | 12:20 PM | Incident channel created | | 2:20 PM | Reproduced the issue PR checks reporting not working via delegate | | 2:40 PM | Rolled back delegate in rings | | 3:30 PM | Informed customer about rollback | | 3:40PM | Customer confirmed restoration | ## RCA ### Why didn't certain CI pipelines update SCM/Git status? * The delegate task created to update the status on Git failed causing the status to be not reflected on Git provider. ### Why did the delegate task fail? * A change made to support the new Harness Code module introduced a missing dependency that failed during run-time. ### Why didn't it impact all the CI pipelines? * Harness provides two ways to connect to Git providers - via Harness Platform or via Harness Delegate. Only the CI pipelines using git connectors with a Delegate as the mode of connection did not update the status back to SCM/Git providers. This did not impact those pipelines whose git connectors are connected via the Harness Platform. This is because the missing dependency issue was in the Delegate. ### Why was the missing dependency not predicted in the testing phase? * We have automated tests that check running pipelines via Harness Delegate but the tests to check Git status updates in this mode of connectivity were missing. ## Steps taken Harness CI engineers tried to reproduce in-house with various combinations of infrastructure such as Kubernetes, Harness Cloud, Virtual Machines, etc but it took some time to realize this happens only when the git connector is set up to be connected via Harness Delegate instead of Harness Platform. As soon as we realized this, we engaged the delegate engineering team and they helped with reverting the delegate version to a previous one that did not have this code. ## Follow-up actions Add automation to catch this case and also set up internal alerts when the issue happens so things can be handled proactively.
Status: Postmortem
Impact: Minor | Started At: Oct. 4, 2023, 7:20 p.m.
Description: # Summary CI / CD pipeline executions slowed down or timed out because of the high latency of Redis calls from the log service. # Mitigation 1. Engaged Redis support 2. Increased shards for log service 3. Monitored primary shard until CPU usage returned to the expected range # Detailed Timeline\(PST\) | **Time** | **Event** | **Notes** | | --- | --- | --- | | 8:42 AM | Firehydrant triggered for CI pipeline performance degradation | | | 8:45 AM | Checked Redis memory - it was under the limit | | | 9:05 AM | Figured out that the stream write calls are taking a very long time hence resulting in longer execution times | | | 9:25 AM | Created a P0 with Redis Support | | | 9:50 AM | Increase log-service memory | Since writes were taking longer, the API payloads were still present in log-service increasing the memory usage | | 10:30 AM | Redis support joined the call Requested shard logs to understand what’s causing the high latency for Redis operations | Explained the chain of events including Redis memory increase from last week. | | 10:51 AM | Deployed a change to decrease the number of lines in a log stream | Created a temporary fix to decrease the size per stream | | 11:15 AM | Discovered 100% CPU utilization on Redis shard Performed failover in an attempt to decrease the CPU utilization - did not help | CPU was at 100% since Friday \(including the weekend when the load is low\) | | 12PM - 1:30 PM | Gradually increased the number of shards Received logs for 30 second time window for the hot shard \(requested at 10:30 AM\) | Saw the keys getting distributed evenly across shards but the CPU utilization of the host shard did not come down Hot shard - 100%, Other shards - 30-40% Saw some CRDT operation logs in the shard logs | | 2:30 PM | Redis team still investigating the issue. Requested for all shard logs and CRDT sync logs. | | | 3:34 PM | Received logs for 30 second time window for all shards | | | 4:01 PM | Redis team pointed out that even though the keys were distributed evenly, the hot shard was consuming more than twice the memory of the newly provisioned shards | Harness team pointed out that the replication was out of sync and there were a lot of CRDT.MERGE logs in the hot shard logs which were missing in the other shard logs | | 4:08 PM | failover and primary got in sync | | | 4:12 PM | CPU utilization started dropping along with the high memory usage on the hot shard | Incident was marked as resolved | # RCA ### Why were the pipelines running slow? The current Redis shard was not able to handle the log streaming load and its CPU was running at 100% and causing higher latency. ### Why was the Redis shard CPU running at 100%? Observations based on the call with the Redis support team 1. On 09/29, We noticed the Redis instance was running close to capacity. CloudOps team increased the size of the Redis instance to accommodate for increased load. Unfortunately, the team failed to make corresponding changes in the secondary cluster. 2. As per [log-service DB alerts](https://docs.google.com/spreadsheets/d/1uw9U-bqlaXZUv44jEW64tMHd2mU2s5oVXVIftQ3QaQ0/edit?usp=sharing) in Redis, the sync was failing since 9/29 and did not recover until Monday 10/02 \(OOM and connection errors\) 3. We suspect the sync process might be in a bad state resulting in keeping the CPU at 100% during the weekend too. We are awaiting a detailed RCA from Redis support. # Follow up actions 1. Updated the memory of failover to match with primary 2. Increased the number of shards 3. Working on enhancing monitoring and alerts around latency spikes
Status: Postmortem
Impact: Minor | Started At: Oct. 2, 2023, 5:06 p.m.
Description: ## Impact Users were unable to log in to [app.propelo.ai](http://app.propelo.ai/) and [app.levelops.io](http://app.levelops.io/) through the [https://app.propelo.ai/signin](https://app.propelo.ai/signin) flow. Data ingestion, data processing, propels and any user already logged in in the US region were unaffected. The EMEA and APAC regions were not affected. **Workaround**: The users would still be able to log into the system via [SEI](https://app.propelo.ai/auth/login-page) flow as this issue impacts only the **/signin** flow of the application ## Root Cause One of the login flows is affected by tenant deletion tasks as the login flow looks at all the tenants in order to find out which tenants the user trying to log in has access to. * If the users table of the available tenants doesn't exist, the flow fails entirely. * During tenant deletions, several tenants were removed but the entry for the tenant to be considered 'available' was not deleted. * The connection to the DB was severed due to a VPN disconnect. ## Timeline | **Time** | **Event** | | --- | --- | | 2023-10-02 05:45 AM PDT | The issue was resolved | | 2023-10-02 05:40 AM PDT | The tenants which were marked for deletion were removed from the global list of tenants and operations are normal post the deletion of these unused tenant ids. | | 2023-10-02 05:10 AM PDT | Issue was identified via internal testing and incident was triggered. | ## Action Items * Institute a downtime window and alerting mechanism to stakeholders on the maintenance activity. * Also, perform verification / run sanity tests across tenants in respective regions to ensure the app is up and running. * Review the logic for /login and add any guardrails w.r.t the check on the global tenants list. * Institute a process/tool for tenant de-provision on the legacy SEI module.
Status: Postmortem
Impact: Major | Started At: Oct. 2, 2023, 12:10 p.m.
Description: Root Cause was due to DockerHub incident [https://www.dockerstatus.com/pages/533c6539221ae15e3f000031](https://www.dockerstatus.com/pages/533c6539221ae15e3f000031)
Status: Postmortem
Impact: None | Started At: Sept. 28, 2023, 9:16 p.m.
Description: # Overview All customers experienced disruptions with Dashboards and Perspectives. However, users could still access CCM and utilize other sections of the system. # Timeline \(PST\) | **Time** | **Event** | | --- | --- | | 10:17 AM Sept 28 2023 | Users reporting slowness in Dashboards and perspectives. Engineering team started the investigation. | | 11:46 AM Sept 28 2023 | Finds out GCP Bigquery is having autoscaling issues for muti-region US. | | 12:52 PM Sept 28 2023 | The usage of BigQuery has been redirected to a different service account from another GCP project that employs on-demand pricing. | | ~2:30 PM Sept 28 2023 | Perspectives have been restored and are now operational. Dashboards are functioning to some extent, though a few customers are still experiencing issues. | | 5:12 PM Sept 28 2023 | Google reports BigQuery issue resolved | | 5:12 PM Sept 28 2023 | Dashboards have been fully restored and are now operational for all customers. | # Resolution Redirecting BigQuery's service account to a different GCP project with on-demand billing resolved the problem. # Affected Users Users in Prod1 and Prod2 were impacted. Only the Perspectives and Dashboards features were affected, while the rest of CCM operated without issues. # RCA Google's BigQuery employs slot AutoScaling to enhance slot availability for better performance. An incident with BigQuery hindered the slot AutoScaling functionality. Since CCM's perspectives and dashboards rely on BigQuery, the incident impacted their query response times.
Status: Postmortem
Impact: Critical | Started At: Sept. 28, 2023, 8:23 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.