Last checked: 9 minutes ago
Get notified about any outages, downtime or incidents for InfluxDB and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for InfluxDB.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
AWS: Sydney (Discontinued) | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Cloud Dedicated | Active |
API Reads | Active |
API Writes | Active |
Management API | Active |
Cloud Serverless: AWS, EU-Central | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Cloud Serverless: AWS, US-East-1 | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Cloud Serverless: AWS, US-West-2-1 | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Cloud Serverless: AWS, US-West-2-2 | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Cloud Serverless: Azure, East US | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Cloud Serverless: Azure, W. Europe | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Cloud Serverless: GCP | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Google Cloud: Belgium (Discontinued) | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Other Services | Active |
Auth0 User Authentication | Active |
Marketplace integrations | Active |
Web UI Authentication (Auth0) | Active |
View the latest incidents for InfluxDB and check for official updates:
Description: This incident has been resolved.
Status: Resolved
Impact: None | Started At: May 12, 2023, 9 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: None | Started At: May 12, 2023, 9 a.m.
Description: # **Incident RCA** RCA - query errors in us-central-1 on May 11, 2023 ## **Summary** One of the customers in this cluster submitted a large number of deletes. By itself, this would not have caused an outage. However, at the same time, one of the storage pods ran out of disk space. We added more disk space, but due to the large number of tombstone files \(created to keep track of deleted measurements\) the pod was very slow to recover, and there was a high rate of query failures until the storage pod recovered. ## **Cause of the Incident** The immediate root cause was that a disk filled up. Under normal circumstances, this is not service-impacting. We get alerted when the disk is close to filling up, and we have a run book in place to add capacity to the storage layer. The pod must be restarted after adjusting the disk size, and when the pod restarted, it was unavailable for a long time, while processing all the tombstone files. ## **Recovery** As soon as we identified that deletes were contributing to the slow recovery, we reached out to the customer that had generated the large volume of deletes, to ask them to stop sending requests to delete measurements, while the cluster was in recovery mode. While we waited to hear back from them, we blocked deletes for all customers on this cluster as a temporary measure. We also manually removed the tombstone files from one replica of the most heavily impacted storage partition, so that it could recover more quickly. This enabled the cluster to return to normal operation and process queries. Meanwhile the other replica of this partition continued to process the backlog of tombstone files so that once it was complete we could restart both replicas and the data will be complete and correct. ## **Timeline** May 11, 2023 18:20 - Alerted that storage pod disk was close to capacity May 11, 2023 18:43 - Added more disk capacity, and restarted the pods. May 11, 2023 19:05 - Pod was very slow to start, so queries started failing. Storage pod became unavailable and queries started failing. Investigations showed that the problem partition was pegged processing the enormous number of deletes, but that it was making progress. We decided to let the process run its course so that the data was correct. Continuous monitoring of the progress gave a predicted outage time of around two hours. During this time writes were being accepted but queries were failing. This was seen as the least bad option. The partition completed its work but then, because of the number of delete requests that had continued to be added, effectively had to start over. May 11, 2023 21:16 - Blocked all deletes to the cluster and continued to monitor the situation. Progress was still being made. We examined the code paths and decided that this was not a software fault as such and so the best course of action was to allow the process to continue to run. We started investigating ideas to restore services faster. May 11, 2023 22:00 - Manually deleted tombstone files on the secondary replica of the impacted storage pods to speed up the recovery process. This means that queries were able to be serviced but that data which should have been deleted was visible. This was decided to be the least bad option at this time. We intended to leave the primary replica to process the remaining deletes expecting this to take many more hours. When this process was complete we would restart the secondary and the replicas would be back in sync with the correct data. May 11, 2023 23:08 - Primary replica of impacted storage pod recovered, query rate back to normal. ## **Future mitigations** 1. We are reducing the rate limit for deletes, to reduce the load on the cluster caused by deletions. 2. We continue to work with customers to find alternative approaches that reduce the need for them to delete measurements.
Status: Postmortem
Impact: None | Started At: May 11, 2023, 7:42 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: None | Started At: April 13, 2023, 4:37 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: April 5, 2023, 2:19 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.