Last checked: 3 minutes ago
Get notified about any outages, downtime or incidents for UiPath and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for UiPath.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Action Center | Active |
AI Center | Active |
Apps | Active |
Automation Cloud | Active |
Automation Hub | Active |
Automation Ops | Active |
Autopilot for Everyone | Active |
Cloud Robots - VM | Active |
Communications Mining | Active |
Computer Vision | Active |
Context Grounding | Active |
Customer Portal | Active |
Data Service | Active |
Documentation Portal | Active |
Document Understanding | Active |
Insights | Active |
Integration Service | Active |
Marketplace | Active |
Orchestrator | Active |
Process Mining | Active |
Serverless Robots | Active |
Solutions Management | Active |
Studio Web | Active |
Task Mining | Active |
Test Manager | Active |
View the latest incidents for UiPath and check for official updates:
Description: ## Customer impact From 2024-04-03 23:45 UTC to 2024-04-04 02:35 UTC our customers experienced errors when accessing some of the services located in the US region of Automation Cloud. Impacted products include Automation Cloud, Orchestrator, Automation Hub, Automation Ops, Document Understanding, Serverless Robots, Cloud Robots - VM, Solutions Management, and Insights. ## Root cause UiPath makes extensive use of Azure SQL. At the beginning of the outage, Microsoft performed routine SQL maintenance in the East US region. Typically this is done without any visible impact to our customers. But for some reason this maintenance caused the SQL Databases in this region to become unavailable. We are still waiting for a root cause from Microsoft and will update this document once we receive it. ## Detection Automated alerts immediately detected the issue and notified UiPath on-call engineers. They confirmed the scope of the outage and updated [status.uipath.com](http://status.uipath.com/). ## Response After a brief investigation, we determined that the problem was with Azure SQL. We reached out to Microsoft Support to request assistance. For the US region of all UiPath products, we place the primary database in Azure’s East US region and a failover database in Azure’s West US region. By default, Azure will failover from primary to secondary after the primary is unavailable for 60 minutes. During this incident, most databases automatically failed to the secondary region. Unfortunately, the Orchestrator, Automation Hub and Insights databases did not. The UiPath engineers investigated the databases and began to trigger a manual failover, but by that time Microsoft had resolved the underlying issue in the East US region. ## Follow up * Work with Microsoft to get a root cause for the underlying Azure SQL outage. * Determine why Orchestrator, Automation Hub and Insights did not failover to the secondary region. Perform a failover drill to confirm the problem has been fixed. * Investigate if the automatic failover period can be reduced from 60 minutes.
Status: Postmortem
Impact: Major | Started At: April 4, 2024, 12:02 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Critical | Started At: March 28, 2024, 5:16 a.m.
Description: The fix has been applied on all geos and all communications were restored now.
Status: Resolved
Impact: Minor | Started At: March 25, 2024, 11:47 a.m.
Description: # Background UiPath Communications Mining is deployed globally across multiple regions. Each region is independent of all others with independent deployments of databases and stateless services. Multiple, different distributed database solutions are deployed in each region for different purposes. Historically we used a strongly consistent, horizontally scalable document store, for most ground-truth data storage, but due to a variety of reasons, including operational concerns relevant to this outage, in the last year we've been migrating away from this store to a distributed SQL database, instead. But today, much of our data \(~1B rows, ~5 TiB\) is still stored in this legacy document store. # Customer Impact * Performance degradation and elevated error rates \(HTTP 500 error codes\) for tenants in the EU region starting at the weekend on Mar 16, 10:02 UTC and Mar 18. * From Monday, Mar 18, 11:37 UTC, analytics and UI fully back up, but training, ingestion and streams continued to experience issues. * All functionality was fully restored on Wednesday, Mar 20 at 10:20 UTC. * 35 tenants in the EU were affected, and no tenants in other regions were impacted. # Root Cause The outage was caused by an interaction of multiple issues. At its core, however, the incident was triggered by a manual scaling operation that started on Saturday, Mar 16th that exposed fundamental problems in our legacy document store: 1. Explicit table re-sharding causes a temporary reduction in fault tolerance 2. Unexpected memory mapped page count exhaustion caused multiple DB nodes to crash simultaneously 3. Kubernetes security controls, read-only filesystems with unprivileged containers prevented in-place updates to sysctl-s, requiring further DB restarts to increase memory mapped page limits \(vm.max\_map\_count\) 4. Crashes exposed flaws in our document store fail-over mechanism, causing nodes to enter a "viral" state where failover nodes also entered a backfilling state. 5. Eventual solution was manually re-creating a subset of the database tables and repopulating them with data from the old, now read-only tables. 6. New tables suffered from very slow secondary index reconstruction performance issues in our document store # Detection Due to increased usage in the EU region, we started scaling up our document store cluster on Jan 30. We added two new nodes and started re-sharding and moving tables to the new nodes over the next month and a half, during weekends to avoid customer impact. Until the weekend of Mar 16, these operations all completed without a hitch. As soon as we started re-sharding one of the only two remaining tables at 10:02 UTC on Mar 16, two database nodes crashed simultaneously due to exhaustion of memory mapped pages \(`vm.max_map_count`\). There was on-call engineer actively monitoring the process at the time, but the issue was also picked up within minutes by our automated alerts. # Response Since all our workloads run on read-only, unprivileged containers, increasing this limit is impossible without restarting all the nodes. So the focus was bringing the cluster into a fully replicated state so we could then run a controlled restart to increase the `max_map_count` on all the nodes. Because of the hard crash during a re-sharding operation, the database started exhibiting unexpected behaviour and entered a degraded state and would sporadically enter read-only states and would not accept writes before a full integrity check. Furthermore the recovery process did not seem to ever fully complete. By Sunday evening a sufficient number of replicas became available. Our automatic nightly backup process started on 23:00 UTC Sunday, Mar 17 which added sufficient load to the database such that it experienced another four node crashes between 01:00 to 06:00 on Monday, Mar 18, due to `max_map_count` exhaustion. The DB reverted to the same degraded state as above, with very lengthy automated "backfilling" processes that never completed and during which the DB entered read-only mode. Due to the risk of further crashes before recovery is complete, on 08:42 UTC Monday, Mar 18 we made the decision to go ahead with the controlled restart to increase `max_map_count`, even though the database was not in a fully recovered state. This resulted many additional hours of downtime, but in exchange gave us the confidence that it would be able to complete successfully without additional unexpected crashes. By 11:37 UTC Monday, Mar 18 all but two tables were fully available allowing us to restore most functionality. The remaining two \(very large\) tables failed to recover through the automated process multiple times. We rapidly built and after significant testing and iteration, deployed an emergency batch job at 04:20 UTC Tuesday, Mar 19. This created new tables and copied all rows into fresh tables, while maintaining availability of the rest of the product. This process eventually completed at 07:30 UTC Tuesday, after which we could start rebuilding the secondary indexes in the new tables. Reindexing ~200M rows in these tables took over 24h, finally completing at 10:20 UTC, Wednesday, Mar 20, restoring all functionality. # Follow-ups This is the most significant outage UiPath Communications Mining has ever experienced and it was caused by the one of our core data stores. We had been aware of issues with this document store, and have been migrating away from it slowly over the last year. The next steps are 1. Halt further scaling of the document store. The number of replicas today can handle current and forecasted load for at least another year. We know the database is resilient in its steady state. 2. Reduce the amount of data stored in our database by more aggressively garbage collecting old data, and moving larger objects into blob storage that are referenced from the database instead. 3. Reprioritise the migration away from this legacy store as critical, aiming to complete it in the next six months, and start with database tables that caused most problems during this incident first.
Status: Postmortem
Impact: Critical | Started At: March 18, 2024, 9:18 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Major | Started At: March 13, 2024, 3:31 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.