Last checked: 2 minutes ago
Get notified about any outages, downtime or incidents for Harness and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Harness.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Service Reliability Management - Error Tracking FirstGen (fka OverOps) | Active |
Software Engineering Insights FirstGen (fka Propelo) | Active |
Prod 1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 2 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 3 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 4 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Prod Eu1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
View the latest incidents for Harness and check for official updates:
Description: #### Overview We experienced a service disruption in our production environment, specifically impacting the Redis memory usage in our freemium offering. #### What was the issue? The core of the issue was the Redis memory in prod2 \(freemium\) reaching near full capacity. This led to operational failures in dependent services, primarily due to Redis running out of memory \(OOM\). The root cause analysis identified a significant increase in memory consumption by one of the Redis streams \(named freemium:streams:DEBEZIUM\_idpMongo.idp-harness.backstageCatalog\), which started consuming an unusually high amount of memory \(~6 GB\) following the latest release of the idp-service \(version 1.6.0\). Furthermore, pipeline service-related caches were also found to be consuming higher memory than anticipated. #### Timeline | **Time \(In IST\)** | **Event** | | --- | --- | | 1st March 11.45 PM | STO uptime monitoring failed with Redis OOM | | 1st March 11.53 PM | FH Triggered | | 1st March 11.54 PM | Pipeline failures reported due to Redis OOM | | 2nd March 2.02 AM | Redis Events framework database memory was increased by 25% | | 2nd March 2.03 AM | Issue resolved after memory increase | | 3rd March 12.36 AM | debezium service is bounced with updated config which disabled IDP mongo collections streaming | | 3rd March 1.11 AM | Stream “freemium:streams:DEBEZIUM\_idpMongo.idp-harness.backstageCatalog“ was trimmed in prod2 to reclaim the memory | #### Resolution The immediate resolution involved increasing the memory allocated to the Redis events framework database by 25% and disabling the stream flow that was consuming excessive memory. This action effectively resolved the incident within two hours. #### Action Items Following this incident, we are taking several steps to prevent recurrence: * Implement rigorous validation of changes with respect to Redis memory usage in both QA and PROD environments with each release. * Investigate and rectify the increased message size issue in the `backstageCatalog` stream when published to Redis. * Establish alerts for individual streams to promptly notify the relevant teams. * The Pipeline team will conduct a thorough review of streams related to their services, including the `webhook_events_stream` and `git_push_event_stream`.
Status: Postmortem
Impact: Minor | Started At: March 1, 2024, 6:40 p.m.
Description: **Overview :** CI Pipeline getting failed post rolling out new version -1.15.3 **What was issue:** There was a change included in CI manager 1.15.3 build related to a code rewrite. Due to this change we had a scenario where there was backward incompatibility between two builds. Pipelines where plan creation happened on earlier version and execution on newer, failed. **Timeline**: | **Time** | **Event** | | --- | --- | | Feb 27 2024 3:09:33 PM IST | CI Manager new version was deployed to prod | | Feb 27 2024 3:12 PM IST | CI Internal Sanity got failed, Internal incident created. | | Feb 27 2024 3:22 PM IST | CI Manager reverted to older version | **Resolution:** We rolled back the release immediately as Internal Sanity failed. **RCA & Action Items:** Adding automated/manual check for deployment transition pipeline or execution such that we catch these issues ahead of time.
Status: Postmortem
Impact: Minor | Started At: Feb. 27, 2024, 10:17 a.m.
Description: **Overview:** Hosted CI Builds on MacOS are failing to initialise in all environments. **What was the issue?** * To fix the registry issue which was recurring on 10 Feb 2024. [https://manage.statuspage.io/pages/zmnt6tkys0q0/incidents/sd82v3lmcqqd#postmortem](https://manage.statuspage.io/pages/zmnt6tkys0q0/incidents/sd82v3lmcqqd#postmortem) we attempted a release for Anka controller. * Controller would be required to initialise the VM’s and manage the state of the VM’s, it being down means the Mac pipelines were not functional. * Post debugging we identified that there were network configurations issue that had to be re-configured to ensure the controller was accessible. **Timeline**: | **Time** | **Event** | | --- | --- | | 22nd Feb 2024 8:13 AM IST | We started deployment for Controller to fix issue | | 22nd Feb 2024 8:24 AM IST | We noticed Controller was not coming up, Hence started revert of release. | | 22nd Feb 2024 8:30 AM IST | Revert did not work here, hence Internally incident was created and an investigation started. | | 22nd Feb 2024 12:11 PM IST | Dlite deployment - bounce done with new network changes on all cluster. | **Resolution:** We resolved network configurations related issues for controller, further it was accessible. **RCA & Action Items:** As part of the improvements, we will be moving this to a high availability setup. We will also be updating the alerting and monitoring around this workflow to capture such issues immediately.
Status: Postmortem
Impact: Major | Started At: Feb. 22, 2024, 3:03 a.m.
Description: ## Overview Hosted CI Builds on MacOS are failing to initialize in all environments. ## Timeline ## Resolution A server used for MacOs build farm orchestration caused the image repository to be unavailable. The server was made operational and the system restored. ## RCA & Action Items As part of the improvements, we will be moving this to a high availability setup. We will also be updating the alerting and monitoring around this workflow to capture such issues immediately.
Status: Postmortem
Impact: Minor | Started At: Feb. 12, 2024, 3:20 a.m.
Description: **Incident Summary:** Due to an increase in traffic, there was a period of high latency experienced on the Feature Flag metrics service This was caused by the service not able to scale up quickly enough to handle the additional load automatically, and the service to become slow, and returning errors. Once identified by the team, the cloud engineer was able to manually scale up the service, and the service was restored **Timeline** | **Time \(UTC\)** | **Event** | | --- | --- | | 18:11 PM | Large number of requests seen coming through the network | | 18:14 PM | Service gets into a degraded state, returning an increase in errors, and latency | | 18:14 PM | On call engineer is alerted and begins investigation | | 18:24 PM | Service is manually scaled up to handle the load | | 18:24 PM | Development team begin RCA | | 18:41 PM | All requests return to normal operational behaviour | | 18:41 PM | Incident resolved | **Root Cause Analysis:** The incident originated from an increased rate of requests on the Prod 1 environment, causing the Feature Flag metrics service to get into a degraded state. While the service has auto-scaling capabilities in place, the sudden increase, and size of the increase resulted in the automated scaling to be inefficient, and manual intervention was required **Immediate Resolution:** To address the incident promptly, the team increased the resource capacity of the affected service, until the service was able to resume normal operations. **Preventive Measures:** To prevent similar incidents in the future while the team are addressing working on improvements, resources have been adjusted in the affected cluster to better handle sudden traffic spikes **Action Items:** We have identified a number of bottlenecks that resulted in the incident, and the development team are actively working on improvements
Status: Postmortem
Impact: Minor | Started At: Feb. 5, 2024, 6:35 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.