Last checked: a minute ago
Get notified about any outages, downtime or incidents for Harness and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Harness.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Service Reliability Management - Error Tracking FirstGen (fka OverOps) | Active |
Software Engineering Insights FirstGen (fka Propelo) | Active |
Prod 1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 2 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 3 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 4 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Prod Eu1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
View the latest incidents for Harness and check for official updates:
Description: After deployment on the prod-3 cluster, NextGenUI got stuck on the initial loading screen. The issue was observed immediately during post-deployment sanity. We identified the problem to be with our required static resources failing to load. This release included a change to how we build and load the UI for different environments. The change involved making the source for static-files configurable per-environment. But an incompatible configuration for the prod-3 cluster prevented the correct URL from being formed, resulting in 404 for our JS resources. We mitigated the incident by updating the service configuration for this environment and re-deploying the Nextgen UI service. With the new configuration, the UI service was able to generate the correct URLs, and the issue was resolved. ### Timeline | **Time \(UTC\)** | **Event** | | --- | --- | | 12:44 AM | Incident was first detected after the new deployment. An internal incident was raised, and the team started looking into the issue. | | 12:46 AM | Root cause identified and the fix was deployed. | | 12:47 AM | Incident resolved | ### Action Items * We are auditing the service configurations for all environments with an aim to minimize the differences. * Improve the Nextgen UI build process to handle incompatible configurations.
Status: Postmortem
Impact: Major | Started At: Jan. 30, 2024, 12:44 a.m.
Description: **Incident Summary:** On January 29, 2024, a disruption occurred in the Prod 2 environment, affecting the execution of AutoStopping rules. Users reported issues, resulting in a total downtime of 56 minutes. The incident was promptly addressed, with a resolution time of 1 hour and 17 minutes. **Timeline of Events:** | Timestamp \(UTC\) | Event | | --- | --- | | January 29, 2024, 06:13 AM | FireHydrant incident was opened. | | January 29, 2024, 06:13 AM | Incident acknowledged, and internal investigation initiated on the incident Slack channel. | | January 29, 2024, 06:24 AM | Root cause identified: A component critical for rule execution encountered errors. | | January 29, 2024, 06:57 AM | Immediate resolution applied to address the identified component issue. | | January 29, 2024, 07:20 AM | System stability restored; rule executions were near optimal. | | January 29, 2024, 07:34 AM | FireHydrant incident closed, and the incident marked as resolved. | **Root Cause Analysis:** The incident originated from the AutoStopping feature in the Prod 2 environment, causing a critical failure in a component crucial for rule execution. This resulted in a disruption of rule operations and a failure to transition messages to the enqueued state. The system relies on a data store that encountered difficulties persisting data, leading to operational failures. The root cause was related to capacity limitations in a specific data storage component, causing it to be unable to handle the increased volume of messages during the incident. **Immediate Resolution:** To address the incident promptly, the team increased the capacity of the affected component. This allowed for the expedited processing of rule operations and a swift resolution of the issue. **Preventive Measures:** To prevent similar incidents in the future, the team has implemented enhanced monitoring to receive timely notifications of potential capacity issues. Proactive measures are being taken to ensure the system can effectively handle increased loads. **Conclusion:** The incident was successfully resolved through immediate actions to increase resource capacity. The team is committed to implementing proactive measures to enhance system monitoring and prevent similar occurrences, ensuring the stability and reliability of the system for all users.
Status: Postmortem
Impact: Critical | Started At: Jan. 29, 2024, 6:26 a.m.
Description: ## Summary The 'Customer Overview Page' was loading slowly in the Prod-2 cluster. All other critical functions remained unaffected. ## Timeline | **Time \(UTC\)** | **Event** | | --- | --- | | 04:30 PM | We got an alert, and the customer also reported the issue. | | 04:45 PM | An internal incident was raised, and the team started looking into the issue. | | 05:11 PM | Root cause identified | | 06:04 PM | Incident resolved | ## Resolution The high CPU-intensive maintenance task and the long-running queries were terminated to resume normal operations. ## RCA The dashboard failed to retrieve data from the backend database as the CPU utilization had reached > 90%. The alert came into the system as a Warning event that got overlooked. We observed the CPU spike due to maintenance tasks, some sub-optimal queries running on the primary node, and several active connections from the application side. We proceeded after validating that the queries and the maintenance task could be terminated without any potential data loss. ## Action Items 1. We have moved the maintenance tasks to the secondary node. 2. We are working on addressing the long-running queries coming from the application side. 3. We are also working on implementing the server-side timeout for long-running queries. 4. We will ensure the alerts immediately trigger an incident to the person on-call.
Status: Postmortem
Impact: None | Started At: Jan. 18, 2024, 5:14 p.m.
Description: **Overview** There was an issue reported by multiple harness customers in Prod-2 cluster where 500 errors were seen while accessing or trying to run pipelines and licensing information was also inaccessible. **Timeline** | **Time** | **Event** | | --- | --- | | 8 Jan 7:23 AM UTC | Issue was reported internally along with Customer reporting. | | 8 Jan 7:23 AM UTC | Internal Incident created. | | 8 Jan 7:23 AM UTC | Rolled back system deployment which immediately resolved the issue. | | 8 Jan 7:28 AM UTC | Internal Incident Resolved. | **Resolution** We rolled back our latest system deployment which resolved the issue within 5 minutes of the issue being reported. **Root Cause Analysis** Post our manager service release, a change in licensing resource resulted in cache failures. License API is called to fetch license information to check entitlements of services. Addition of new fields in license resources caused failures which resulted in unhandled exceptions. **Action Item** * We have implemented exception handling around api calls to handle cache failure that avoids service breakdown * Review Cache management during software releases to avoid such failures
Status: Postmortem
Impact: Major | Started At: Jan. 8, 2024, 7:23 a.m.
Description: **Incident Summary:** There was a recent incident related to delays in the evaluation of Asset Governance Rules, stemming from a queue build-up that caused temporary slowness in rule execution. **Timeline:** * **2024-01-04 06:18 PM UTC:** Incident reported . * **2024-01-04 06:20 PM UTC:** Incident acknowledged; investigation initiated. * **2024-01-04 06:20 PM UTC:** Root cause identified. * **2024-01-04 06:39 PM UTC:** Immediate resolution applied to expedite job processing. * **2024-01-04 06:48 PM UTC:** Queue size normalized, incident resolved. **Root Cause Analysis:** The delay was traced back to a build-up in the job queue utilized by the CCM Asset Governance feature. This model employs an asynchronous execution approach using a job queue, where rule executions are enqueued for processing. Workers asynchronously dequeue jobs from this queue to perform actual rule evaluations. **Analysis:** The queue build-up was notable for specific types of evaluations with customers noticing slowness in Asset governance execution. **Immediate Resolution:** To promptly address the issue, the team increased the replica count for the services involved, facilitating quicker job consumption from the queue. **Total Downtime:** There was no downtime during the incident **Follow-up Actions:** 1. Implementation of separate queues for ad-hoc queries and enforcements/recommendations. 2. Enhanced telemetry and metrics monitoring, including alerts on queue lengths for various types. 3. Ongoing investigation to improve asynchronous job execution for faster evaluations.
Status: Postmortem
Impact: Minor | Started At: Jan. 4, 2024, 6:20 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.