Last checked: 6 minutes ago
Get notified about any outages, downtime or incidents for Harness and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Harness.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Service Reliability Management - Error Tracking FirstGen (fka OverOps) | Active |
Software Engineering Insights FirstGen (fka Propelo) | Active |
Prod 1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 2 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Engineering Insights (SEI) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 3 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery (CD) - FirstGen - EOS | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Software Supply Chain Assurance (SSCA) | Active |
Prod 4 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
Prod Eu1 | Active |
Chaos Engineering | Active |
Cloud Cost Management (CCM) | Active |
Continuous Delivery - Next Generation (CDNG) | Active |
Continuous Error Tracking (CET) | Active |
Continuous Integration Enterprise(CIE) - Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Linux Cloud Builds | Active |
Continuous Integration Enterprise(CIE) - Self Hosted Runners | Active |
Continuous Integration Enterprise(CIE) - Windows Cloud Builds | Active |
Custom Dashboards | Active |
Feature Flags (FF) | Active |
Infrastructure as Code Management (IaCM) | Active |
Internal Developer Portal (IDP) | Active |
Security Testing Orchestration (STO) | Active |
Service Reliability Management (SRM) | Active |
View the latest incidents for Harness and check for official updates:
Description: # Summary Customers with no target groups configured were being returned `null` instead of `[]` for the /target-segments request when their sdks started up. This could lead to null pointer exceptions and a failure to initialise for some sdks. ## SDK Customer Impact Issue was related to a number of SDK versions, so we tested them and the latest versions to ascertain impact. Java * 1.3.1: * Behaviour: the `waitForInitialzation` call never unblocks once the exception is thrown and caught in the polling thread. This would have caused user code to “freeze” when the SDK is initialising. * Impact: critical impact, the SDK blocks users code from executing. * 1.6.0 - latest version: * Behaviour: An error is logged that the group was null and that the size of the group could not be calculated. Flags are loaded correctly and the `waitForInitialzation` call unblocks and evaluation calls return the correct variation. * Impact: no functional impact outside of error logs. The correct evaluation will be returned. Node.js * 1.3.1: * Behaviour: `UnhandledPromiseRejection` causes the SDK and application to crash. If the SDK client and `waitForInitialzation` were used in a try-catch block, then an error would be logged and the SDK would serve the correct evaluations. * Impact: * If no exception handling was used on the client, then critical impact and the user’s application would crash. * If exception handling was used, no functional impact outside of error logs. The correct evaluation will be returned. * 1.8.1 - latest version: same behaviour and impact as 1.3.1 Other SDK impact The remaining server SDKs have been tested on their latest versions to ascertain impact. While there were no direct customer reports of issues, this is useful to understand the scope of this issue. * Erlang: 3.0.0: Critical impact, exception thrown could cause an application not to start, depending on how the SDK has been integrated. * Python 1.6.2: High impact, SDK fails to initialise and serves default variations. * .NET 1.7.0: No impact, but error is logged. * Go v0.1.23: No impact. * Ruby 1.3.0: No impact. ## RCA ### Why did some customers experience null pointer exceptions? In the scenario where a customer had 0 target groups the /client/target-segments endpoint returned the value `null` instead of an empty array `[]` ### Why was null being returned instead of an empty array? A change was made in the db layer of the backend to return an empty array rather than a not found error when no target groups exist for a customer. This had the impact of hitting a different codepath. This codepath copies all groups into a new array and changes some data before marshalling and returning the json response. Because no groups existed this copy would mistakenly end up returning a nil object instead of an empty array, which then got marshalled into the `null` json response. ### Why was that change made to begin with? Because of our high request rates we use many layers of caching. A side effect of returning errors from the db layer when no target groups exist was that we wouldn’t cache that response. With some high volume customers having no target groups this led to tens of millions of unnecessary requests hitting the database per week when flags are evaluated which we were attempting to avoid. ### Why was this scenario not caught by tests? Unit tests, end to end tests and sdk specific tests exist for this endpoint however the case where target groups are empty wasn’t full covered. This change was primarily meant to improve performance for the /client/evaluations endpoint which uses this code path and which was manually tested and confirmed to work correctly. The /client/target-segments code path experiencing side affects from this change wasn’t anticipated or caught by automated testing. ## Follow up actions Followup actions can cover the following based on different issues faced along with Jira id’s linked for tracking the followup completion. these followup items must ALSO be linked in the RCA ticket * Test enhancements * Add unit tests for the target groups, working with both none, one and multiple * Add new validation and logging, to ensure valid JSON is already returned by the endpoints
Status: Postmortem
Impact: Minor | Started At: June 25, 2024, 4:15 p.m.
Description: ## **Summary:** Pipeline executions across CI/CD are not advancing as anticipated with some even getting stuck in the prod2 cluster. ## **What was the issue?** One of the micro services was running low on resources due to a pipeline execution consuming more resources when attempting to resolve expressions. This led to all pipelines running slower by \(80%\) during the duration of the incident with few executions getting into an unresponsive state. ## **Timeline:** ## **Resolution:** The pipeline execution that was consuming more resources was aborted and the service pods were restarted to recover the system. ## **RCA** A pipeline contained circular references in the files. The runtime resolution of these references resulted in excessive threads getting into the Waiting state. Although the issue was automatically detected \(and the system was auto-protected by breaking the circuit\), the excessive threads were still consumed due to a higher threshold on the loop detection logic. Since then, we have further reduced the loop threshold and automated blocking such runaway pipeline executions.
Status: Postmortem
Impact: Minor | Started At: June 3, 2024, 11:30 p.m.
Description: **What was the issue?** Pipeline executions were stuck and failing due to a Redis instance in the primary region \(us-west1\) becoming unresponsive. Impact was limited to pipeline executions in NextGen. **Timeline** | **Time** | **Event** | | --- | --- | | 28 May - 8:56 PM PDT | We received alert on high memory utilization on our Redis instance. Team identified a particularly large cache key and attempted to manually delete the key. This resulted in the Redis instance becoming unstable. Pipeline executions began failing or hanging. We are still working with our Redis service provider to understand why key deletion resulted in instance unresponsiveness. Such key deletions have been done previously without issue, consistent with vendor advice. | | 28 May - 9:15 PM PDT | We engaged Redis support team, and executed a mitigation plan. | | 28 May - 9:40 PM PDT | We failed over our application to the secondary Redis instance. The system started to recover. Unfortunately, the secondary region also went into bad state as we migrated load. | | 28 May - 10:02 PM PDT | The primary redis instance became healthy. | | 28 May - 10:25 PM PDT | Application traffic was migrated back to primary Redis region. This restored functionality. | **RCA & Action Items:** Pipelines started failing due to Redis instance becoming unresponsive in the primary region. This was due to manual deletion of a large key. We are working with the vendor to determine why this action caused instability. We will share more details when they become available. As part of the immediate action items, we are implementing an upper bound on cache key size. This will prevent us from getting into the state of high memory usage due to large cache keys. In the longer term, we are revisiting our architecture to eliminate large keys in Redis altogether.
Status: Postmortem
Impact: Critical | Started At: May 29, 2024, 4:06 a.m.
Description: ## What was the issue? Pipelines in Prod1 were experiencing intermittent failures caused by gRPC connection issues between Harness services. The majority of the failed gRPC requests occurred between the CI Manager and Harness Manager \(CG\), resulting in a primary impact on CI pipelines. ## Timeline | **Time** | **Event** | | --- | --- | | May 28, 9:50 AM PDT | The issue was reported by a customer regarding intermittent pipeline failures. The team initiated an investigation but did not identify any issues with the infrastructure. It was determined that the failure was isolated to one specific customer. Teams were promptly alerted to monitor the pipelines of this particular customer. | | May 28, 3:05 PM PDT | The engineering team has observed a few more number occurrences of the issue across other pipelines. | | May 28, 3:30 PM PDT | The team has made the decision to rollback a recent deployment in order to investigate any potential correlations. | | May 28, 5:15 PM PDT | The status for the Prod1 environment has been updated to "degraded performance" due to intermittent issues. | | May 28, 6:20 PM PDT | The issue was suspected to be related to kube DNS resolution, resulting in some gRPC requests failing randomly. GCP support was engaged for further investigation. Additionally, service thread dumps were captured for internal debugging purposes, revealing no suspicious findings within the thread dumps. | | May 28, 7:00 PM PDT | Declared status to “monitoring”. | ## RCA and Action Items: Pipelines experienced failures as a result of internal service communication issues, despite multiple attempts. Initially, the Engineering team suspected kube-DNS problems; however, after consulting with GCP support, this was ruled out. It was noted that certain service pod replicas were receiving an uneven distribution of requests. To tackle this issue, the following corrective actions are currently underway: 1. Enhancing load balancing for gRPC calls among service pods. 2. Incorporating traceId in delegate task submissions. Furthermore, we have set up alerts related to gRPC to address similar situations in the future.
Status: Postmortem
Impact: Minor | Started At: May 29, 2024, 12:15 a.m.
Description: ## Summary A few trial Current Gen\(CG\) CD customers had difficulty Login and accessing Harness Secret Manager in our Prod-2 environment. There was no impact on Next Gen\(NG\) customers. ## Timeline | **Time \(UTC\)** | **Event** | | --- | --- | | 04:22 am | We received internal alerts for an increased error rate related to Secret Manager. | | 06:09 am | An incident was raised when a trial customer reached out. | | 06:28 am | The root cause was identified. | | 07:34 am | Incident was resolved | ## Resolution Upon referencing the code base, we identified a key configuration missing in our database records. The required data was restored from periodic snapshots. ## RCA As a part of the official EOL for CG CD, a cleanup activity in the backend database for all internal accounts was done. In the cleanup process, a legacy configuration which the **Harness Secret Manager** used was deleted. ## Action Items We will perform stringent checks on data before cleanup, followed by sanity to ensure no functionality gets impacted.
Status: Postmortem
Impact: Major | Started At: May 3, 2024, 6:19 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.