Last checked: 7 minutes ago
Get notified about any outages, downtime or incidents for Harness and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Harness.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Service Reliability Management - Error Tracking FirstGen (fka OverOps) | Active |
Software Engineering Insights FirstGen (fka Propelo) | Active |
Prod 1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 2 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 3 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 4 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Prod Eu1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
View the latest incidents for Harness and check for official updates:
Description: # Impact Prod-1 Customers using Test Intelligence or Test Reports management feature with CI pipelines from 7:48 AM to 8:08 AM PST noticed their pipelines failing. # Mitigation * Restarted the prod1 ti-service pods. * Doubled the ti-service prod1 memory to 24Gi. * Added profiler to figure out the root cause. # Timeline | **Time** | **Event** | | --- | --- | | 7:03 AM Sept 28 2023 | Engineering got alerted about Test Intelligence Service restarts. | | 7:45 AM Sept 28 2023 | Internal FireHydrant Incident created to check degraded performance of Test Intelligence service. Test report uploads taking longer than usual. | | 7:48 AM Sept 28 2023 | Test reports upload failing due to Test Intelligence service being down. More restarts. | | 8:08 AM Sept 28 2023 | Crash loop is stopped and engineering is monitoring the pod resource consumption. Test Intelligence service function becomes normal. Proactive measures to increase pod memory and replica counts started. | | 11:45 AM Sept 28 2023 | Memory profiler support added for further debugging. | | 3:00 PM Sept 28 2023 | Root cause determined and additional action items noted. | # Root Cause ### **Why did some CI steps fail to upload test reports?** A Harness internal service called `Test Intelligence service` which is responsible for intelligent test selection and also test report management went into a crash loop. ### Why was the Test Intelligence service crashing? A significant increase in memory usage was identified, leading to the Test Intelligence service entering a crash loop. As a consequence, requests related to Test Selection and Test Reports management experienced failures, disrupting the CI pipeline steps reliant on these functionalities. ### What caused the significant increase in memory usage? Harness recently added Test Intelligence support for pipelines triggered by git Push, but it unintentionally skipped deleting old database entries, causing an Out-of-memory issue and service restarts. This affected CI pipelines using Test Intelligence and those reliant on test reports. # Follow-up action Items: * Took proactive steps to mitigate the issue by increasing resources assigned to the service and cleaning up stale database entries to reduce any further impact. * Fixed the issue causing the entries to keep accumulating and deployed it. * We are working on optimizing database queries to only fetch required fields from each entry to reduce memory consumption * Add alerts for proactively monitoring database sizes
Status: Postmortem
Impact: Major | Started At: Sept. 28, 2023, 2:48 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.