Last checked: 2 minutes ago
Get notified about any outages, downtime or incidents for Harness and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Harness.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Service Reliability Management - Error Tracking FirstGen (fka OverOps) | Active |
Software Engineering Insights FirstGen (fka Propelo) | Active |
Prod 1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 2 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 3 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 4 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Prod Eu1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
View the latest incidents for Harness and check for official updates:
Description: # Overview Our hosted builds operate within a multicloud environment, utilizing virtual machines \(VMs\). We have implemented a fallback to address scenarios where a VM fails to initialize in a specific cloud or region. We received a notification regarding a customer pipeline experiencing initialization failures. This issue arose in specific cases where our primary failed and our fallback on secondary failed as well. | **Time \(PST\)** | **Event** | | --- | --- | | 12/07/2023 11:40 AM | Notified of initialization failures | | 12/07/2023 2:40 PM | Added additional fallback to another region to fix initialization failures | # Resolution We added multiple levels of fallback to mitigate the VM initialization issue. # Affected Accounts A total of 15 customer were affected by this. Only a few pipeline executions were failing because of this and hence we are taking a partial outage of 3 hours for Prod1 and 2 hours 38 minutes for Prod2. # RCA All our fallbacks to initialize VMs to use for Harness CI build pipelines were failing. We added additional fallbacks to mitigate the VM initialization issue and worked on fixing the issues causing a higher failure rate in the current fallbacks # Action Items * Review fallbacks for VM initialization and make them fail safe. * Improve alerting for VM initialization failures. Current alerting was for a percentage failure rate. It has been updated to alert for even a single VM initialization failure
Status: Postmortem
Impact: Major | Started At: Dec. 7, 2023, 8:08 p.m.
Description: **Overview** There was an issue reported by Harness customers in the Prod-2 cluster where the Project Overview dashboard was down. This solely impacted the Dashboard API for all prod-2 customers and all other critical functions remained unaffected. **Timeline** | Time | Event | | --- | --- | | Nov 27 , 9:30 AM UTC | Issue reported on Customer Slack channels. | | Nov 27 , 9:49 AM UTC | Internally incident was acknowledged and an investigation started. | | Nov 27 , 9:57 AM UTC | Rolled back system deployment which immediately resolved the issue. | | Nov 27 , 10:08 AM UTC | Incident resolved from Harness Status Page. | **Resolution** The latest deployment was rolled back to restore the Project overview dashboard within 8 minutes of the issue being reported. **Root Cause Analysis \(RCA\)** The issue was observed post manager service release on 27th Nov on Prod-2. The change included an enhancement to the dashboard service that was incompatible with the service that the dashboard service consumes data from. Regrettably, due to an oversight of coordination in release orchestration, the incompatibility of API contracts across these services was introduced. **Action Items** Our Architecture board will review the deployment management of inter-dependent services and services that use common libraries to avoid running into similar issues.
Status: Postmortem
Impact: Minor | Started At: Nov. 27, 2023, 10:02 a.m.
Description: ## Overview On November 13, 2023, a failure in CCM Cloud Connectors impacted customers with currency preferences enabled. The external API responsible for fetching currency rates failed, resulting in data ingestion failure. This incident solely affected data ingestion, and all other features of CCM remained unaffected. The incident was successfully resolved within ~2 hours and 15 minutes, with no reported downtime. ## Timeline | **Time** | **Event** | | --- | --- | | 2023-11-13, 02:15 PM UTC | Issue first reported on slack channel for a customer account | | 2023-11-13, 02:30 PM UTC | Incident acknowledged, and internal investigation initiated | | 2023-11-13, 03:17 PM UTC | Root cause identified | | 2023-11-13, 03:29 PM UTC | Temporary fix raised, followed by deployment of latest code | | 2023-11-13, 06:55 PM UTC | Data replayed for all affected customers | ## Root Cause Analysis \(RCA\) The incident originated from the failure of an external API fetching currency rates, impacting data ingestion for CCM Cloud Connectors with currency preferences. The decision to use an external API for currency rates was necessitated by the dynamic nature of currency conversion rates. The incident was further exacerbated by the failure of the fallback mechanism, backup currency rates were not populated for the current month. ## Follow-up Actions 1. Add better fallback mechanisms for currency rates. 2. Add monitoring for the external public API.
Status: Postmortem
Impact: Minor | Started At: Nov. 13, 2023, 2:09 p.m.
Description: ## Overview On November 7, 2023, a delay in the Autostopping feature of the Harness CCM platform was observed, affecting certain customers and causing Autostopping rules under fixed schedules not to start at the expected time. This issue was traced back to the system not correctly accounting for daylight savings time. The incident was resolved within 6 hours, with no actual downtime experienced by customers. ## Timeline | **Time** | **Event** | | --- | --- | | November 7, 2023, 6:00 PM IST | Customer reported that Autostopping rules were not initiating as expected. | | November 7, 2023, 6:27 PM IST | Incident response initiated. | | November 7, 2023, 7:00 PM IST | Confirmed customers could manually start resources. | | November 7, 2023, 7:00 PM IST | Issue identified - Fixed schedules were not accounting for daylight savings. | | November 7, 2023, 8:00 PM IST | Ensured there were no infrastructure issues. | | November 7, 2023, 8:30 PM IST | Generated new cron entries considering daylight savings. | | November 7, 2023, 11:45 PM IST | Completed regenerating cron entries for all schedules. | ## Root Cause Analysis The delay in Autostopping rules starting under fixed schedules was due to the system not accounting for daylight savings. The issue was traced back through the following steps: 1. Affected Autostopping rules were under fixed schedules with warm-up operations starting one hour before the expected time. 2. The use of a time zone with daylight savings \(America/New York\) caused the schedule to start an hour earlier than expected. 3. The initial generation of cron entries for schedules did not consider daylight savings. 4. The system erroneously executed the idle time job because the schedule was triggered early, despite being under a fixed schedule. ## Action Items 1. The team is working on a long term fix for this so that daylight savings are automatically considered for fixed schedules 2. Include the daylight computation savings in our tests plan
Status: Postmortem
Impact: Minor | Started At: Nov. 7, 2023, 1:11 p.m.
Description: ## Overview We received a report from one of our customers about issues with their pipeline executions using step groups in our Prod-1 cluster. Details about the incident are below. ## Timeline \(PST\) | **Time** | **Description** | | --- | --- | | 11:52 am PST | One customer reported an issue with their pipeline executions involving step groups. | | 12:11 pm PST | The issue was identified with the previous deployment. | | 12:15 pm PST | Rollback was initiated for the impacted module. | | 12:20 pm PST | Rollback completed. | | 12:30 pm PST | The customer confirmed on issue resolution. | ## Resolution Harness engineers performed a rollback of the latest deployment for one of the internal modules, which immediately eliminated the failure with Expression Engine and resolved the issue with pipeline executions. ## RCA We have identified the problem with our Expression Engine flow where a crucial condition accounting for the possibility of metadata being null if a step, step group, or stage did not have a strategy defined. The issue happens when a specific setting ‘_Enable Json Support for expressions_’ is enabled. We have resolved the incident by performing the rollback of the latest deployment for the affected module which happened earlier in the day. ## Action Items 1. Our automation suite will be enhanced to capture scenarios for the Expression Engine flow and also include the expressions used by our customers. 2. Update our documentation on the best practices for utilising expressions.
Status: Postmortem
Impact: Minor | Started At: Nov. 3, 2023, 6:52 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.