Last checked: 5 minutes ago
Get notified about any outages, downtime or incidents for Harness and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Harness.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Service Reliability Management - Error Tracking FirstGen (fka OverOps) | Active |
Software Engineering Insights FirstGen (fka Propelo) | Active |
Prod 1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 2 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 3 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 4 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Prod Eu1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
View the latest incidents for Harness and check for official updates:
Description: This incident has been resolved. We will provide an RCA after findings are complete.
Status: Resolved
Impact: Minor | Started At: Oct. 16, 2024, 5:09 p.m.
Description: ## **Summary:** Pipeline executions were failing with a time-out error on Prod2. This affected ~3% of pipeline executions. ## **What was the issue?** Tasks are execution units that run on a delegate as part of a pipeline execution. As a pipeline runs, its tasks are broadcast to delegates, and one eligible delegate picks up the task for execution. In case any delegate does not acquire the task within the stipulated time, it is rebroadcast. During this incident, rebroadcast functionality was affected, resulting in pipeline executions getting timed out. ## **Resolution:** We rolled back the service to resolve the issue. ## **RCA** An incompatibility change was rolled out in one of our micro-services, causing deserialization failure for a subset of task types. The rebroadcast threads went into an error state due to this deserialization error, resulting in the failure of pipelines that required task rebroadcasts. The system recovered upon the service's rollback. **Action Item** 1. Added a critical alert for rebroadcast events. 2. Rebroadbast logic is made resilient to task deserialization errors. 3. Unit Test added to catch incompatible contract changes for task data.
Status: Postmortem
Impact: Minor | Started At: Oct. 14, 2024, 3:08 p.m.
Description: ### **Summary** CI-hosted MacOS pipelines were failing during the initialisation step, impacting specific customers using our MacOS-hosted service. ### What was the issue? We tightened a firewall rule for our Mac VM registry that was previously too permissive. As a result, another component couldn’t access the registry, leading to pipeline failures. ### **Resolution** | **Time** | **Event** | | --- | --- | | Sept 1st, 17:00 UTC | Restricted the firewall rule. | | Sept 04, 06:03 UTC | Issue reported by the customer. | | Sept 04, 08:39 UTC | We re-created the firewall rule and validated that the issue was fixed. | ### RCA Our MacOS production setup includes several components. When we restricted the permissive firewall rule, the new rule did not account for the NAT IP address of one of these components. After the change, we ran a full sanity pipeline on the Mac machines, which passed successfully. The issue didn’t surface immediately as the affected component maintains a persistent socket connection, unaffected by the firewall until the connection is re-established or restarted. This explains why the failure didn’t occur immediately after we removed the permissive rule on September 1st. We restored the rule, and the issue was resolved. ### Action Items 1. Restrict the firewall rule again, ensuring that necessary NAT IPs are included. 2. Restart all relevant services when applying firewall rule restrictions. 3. Ensure that all connections are properly drained and re-established when the change is implemented.
Status: Postmortem
Impact: Major | Started At: Sept. 4, 2024, 6:33 a.m.
Description: ## **Summary:** Logged in users started getting redirected to the enrollment screen with “Email verified successfully” message and forced users to enter user details again. Pipeline executions and backend tasks were not impacted. Impact was for accounts in Prod 4 cluster. ## **What was the issue?** We released an incompatible version of Nextgen UI service, resulting in unexpected user flow of new sign up for existing users. This was a human error. ## **Timeline:** ## **Resolution:** | **Time** | **Event** | | --- | --- | | September 03 7:45 PM UTC | Customer reported Login redirection to SignUP page | | September 03 8:15 PM UTC | New deployment happened around the same time. Decided to rollback | | September 03 8:20 PM UTC | Started the partial rollback of FF Proxy changes | | September 03 8:30 PM UTC | Partial rollback didn’t fix the issue. Initiated full rollback | | September 03 9:00 PM UTC | Complete rollback completed and issue resolved | Rollback resolved the issue. ## **RCA** There was a human error in picking the version of NextGen UI service. Post deployment sanity did not catch this issue. Rolling back took longer than expected as multiple services got deployed together. **Action Item** 1. Remove manual process to pick the service versions. Automate the promotion process from lower environments. 2. Improve sanity test to catch above UI flow. 3. Make the rollback process atomic based on the previous known good state.
Status: Postmortem
Impact: Major | Started At: Sept. 3, 2024, 7:45 p.m.
Description: ## **Summary:** Customer experienced login failures with 5xx errors on Prod4 cluster. ## **What was the issue?** Harness platform uses managed memStore internally which experienced “Host error”, this triggered master switchover within seconds. Backend microservices which connect to memStore were not able to reconnect quickly. This issue was with JAVA based services but GO services reconnected properly. ## **Timeline:** | **Time** | **Event** | | --- | --- | | 21 August 4:06:41 PM UTC | Primary memStore went down | | 21 August 4:07:00 PM UTC | Secondary memStore promoted to Primary | | 21 August 4:06:41 PM UTC | Harness services experience RedisResponseTimeoutException | | 21 August 4:14:30 PM UTC | Harness services restores connectivity to new Primary | | 21 August 4:14:53 PM UTC | New instance of memstore added and promoted as Secondary | ## **Resolution:** After 8 min services reconnected to the new primary memStore on its own and things recovered. ## **RCA** JAVA services use redisson library to connect to memStore. The established connection pool doesn’t detect the endpoint going away and these connections eventually get timed out. In case of graceful failover this issue doesn’t happen and only in case of catastrophic failure we encounter this issue. **Action Item** * Detect this catastrophic failure and do a quicker reconnect by services
Status: Postmortem
Impact: Critical | Started At: Aug. 21, 2024, 4:06 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.