Last checked: 8 minutes ago
Get notified about any outages, downtime or incidents for Harness and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Harness.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Service Reliability Management - Error Tracking FirstGen (fka OverOps) | Active |
Software Engineering Insights FirstGen (fka Propelo) | Active |
Prod 1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 2 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 3 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 4 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Prod Eu1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
View the latest incidents for Harness and check for official updates:
Description: **Overview:** **Security Testing Orchestration \(STO\) and IACM module impacted** **What was the issue?** The STO and IaCM modules couldn't complete execution, causing the pipeline execution to time out. The reason was that the Redis keys were rotated, but the two microservices responsible for these modules were still using the older keys. **Timeline**: | **Time** | **Event** | | --- | --- | | 25th Apr 2024 7:03 AM PDT | Issue was noticed & investigation started. | | 25th Apr 2024 7:35 AM PDT | Issue Identified. | | 25th Apr 2024 7:43 AM PDT | Issue was resolved for STO. We continued Monitoring. | | 25th Apr 2024 7:49 AM PDT | Issue was resolved for IaCM. We continued Monitoring. | | 25th Apr 2024 8:00 AM PDT | All modules are declared Operational. | **Resolution:** The STO and IaCM modules were updated to use the new keys. **RCA & Action Items:** Two microservices were missed in the update because they had different configuration formats in QA vs. Production. Our change management process did not account for this discrepancy. As part of the improvement process, we will standardize the configurations across environments and add relevant checks for key rotation in the change management process.
Status: Postmortem
Impact: Major | Started At: April 25, 2024, 2:03 p.m.
Description: **Issue Description:** Customers experienced a disruption in the Azure copy of cloud cost data due to authentication failures caused by incorrect token formatting. **Resolution Time:** 3 hours 48 minutes **Root Cause Analysis:** * Token formatting had changed, leading to failures. * The token format changed because of a caching issue in the component responsible to populate the configmap which ended up in malformed configmap which is used in Azure data sync. * This was a partial outage impacting few customers. **Prevention Measures and Follow-Up Actions:** 1. Improve metric accuracy and alerting for datasync jobs. 2. Enhance our automation suites. **Conclusion:** The issue stemmed from token formatting inconsistencies, which have been addressed with updated tokens and preventive measures implemented to avoid future disruptions.
Status: Postmortem
Impact: Minor | Started At: April 4, 2024, 2:38 p.m.
Description: ## **Summary** Pipeline executions across CI/CD/STO are not advancing as anticipated, with some even getting stuck across various customers in the prod2 cluster. ## **Timelines** | **Time \(PST\)** | **Event** | | --- | --- | | 08:50 am | System instability alert was received and investigation was initiated. | | 09:20 am | Identified the culprit configuration that led to increased load on our systems. | | 10:00 am | Increased resource allocation to help with increased load. | | 10:15 am | Updated the invalid configuration from the system to make it valid. | | 11:00 am | Systems back to normal | ## **RCA** The Harness pipeline engine functions within a microservice ecosystem, working alongside various framework components to manage expressions. These expressions often involve variables and configuration files, which can be stored in Git repositories. One of these configuration file contained a self-referential expression. This recursive reference repeatedly called for the resolution of the same configuration file, triggering a loop that led the service to exhaust its resources. ## **Resolution** We've refactored the configuration to remove the recursive reference and restarted the service. Additionally, we've deployed hotfixes to prevent the reintroduction of such configurations and implemented mechanisms to auto-detect and halt recursion within the service. ## **Additional Action Items** To expedite the RCA and mitigate incidents promptly, we're implementing additional logging and alerting mechanisms to detect specific instabilities. This will enhance our ability to identify and address issues swiftly.
Status: Postmortem
Impact: Major | Started At: March 26, 2024, 4 p.m.
Description: ## Summary The 'Harness Overview Page' failed to load intermittently in the Prod-2 cluster while the critical functionality remained unaffected. ## Timeline | **Time \(UTC\)** | **Event** | | --- | --- | | 01:33 PM | Received internal alerts for synthetic monitoring failure. | | 01:47 PM | An internal incident was raised. | | 01:48 PM | The root cause was identified. | | 02:16 PM | Incident was resolved | ## Resolution The high CPU-intensive queries were terminated in the backend database to resume normal operations. ## RCA The Overview dashboard failed to retrieve data from the backend database as the CPU utilisation reached critical levels resulting in significant delays in processing regular queries. We have isolated configurations within the database which in combination with the application’s retry mechanism lead to undue load on the database server leading to an unhealthy state. ## Action Items 1. We have modified the database configurations which shows significant promise as per initial observations. 2. We are in the process of moving the queries to horizontally scaled database nodes. 3. More database optimizations are being planned for the application calls
Status: Postmortem
Impact: Minor | Started At: March 20, 2024, 2:01 p.m.
Description: #### Overview The 'Customer Overview Page' was loading slowly in the Prod-2 cluster. All other critical functions remained unaffected. #### Timeline #### What was the issue? The incident occurred when the dashboard failed to retrieve data from the backend database, which was traced back to the CPU utilization of the database exceeding 90%. This critical level of utilization triggered alerts. The surge in CPU usage was primarily due to an increase in load from the application's operations. The simultaneous demands on the database resources led to significant constraints, hindering its ability to process requests efficiently. #### Resolution To mitigate the issue and restore normal operations, immediate action was taken to terminate long-running queries that were contributing to the high CPU utilization. Additionally, the number of data-consuming services was reduced temporarily. These measures effectively decreased the load on the database, allowing its operations to resume at a normal pace and ensuring the availability of the dashboard data retrieval functionality. #### Action Items In response to this incident, the following action items have been identified and are being implemented to prevent recurrence and improve system resilience: 1. **Distribute Database Load:** To better manage and distribute the incoming query load, especially during peak times, we will distribute database query load across 2 database instances. 2. **Annotate Logs for Better Analysis:** Work is underway to enhance our logging strategy by annotating logs with details that will help in identifying patterns in query behavior. This enhancement will facilitate more granular analysis and understanding of how queries interact with the database resources.
Status: Postmortem
Impact: Minor | Started At: March 14, 2024, 4:07 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.