Last checked: 2 minutes ago
Get notified about any outages, downtime or incidents for InfluxDB and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for InfluxDB.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
AWS: Sydney (Discontinued) | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Cloud Dedicated | Active |
API Reads | Active |
API Writes | Active |
Management API | Active |
Cloud Serverless: AWS, EU-Central | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Cloud Serverless: AWS, US-East-1 | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Cloud Serverless: AWS, US-West-2-1 | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Cloud Serverless: AWS, US-West-2-2 | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Cloud Serverless: Azure, East US | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Cloud Serverless: Azure, W. Europe | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Cloud Serverless: GCP | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Google Cloud: Belgium (Discontinued) | Active |
API Queries | Active |
API Writes | Active |
Compute | Active |
Other | Active |
Persistent Storage | Active |
Tasks | Active |
Web UI | Active |
Other Services | Active |
Auth0 User Authentication | Active |
Marketplace integrations | Active |
Web UI Authentication (Auth0) | Active |
View the latest incidents for InfluxDB and check for official updates:
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: June 19, 2024, 6:35 p.m.
Description: # RCA **Sustained query error-rate in AWS eu-central-1 on June 7, 2024** ### Background Data stored in InfluxDB Cloud is distributed across 64 partitions. Distribution is performed using a persistent hash of the series key, with the intent that the write and query load will, on average, be distributed evenly across partitions. When an InfluxDB Cloud user writes data, their writes first go into a durable queue. So, rather than being written directly by users, storage pods consume ingest data from the other end of this queue, amongst other things, this allows writes to be accepted even if storage is encountering issues. One of the metrics used to reflect the status of this pipeline is [Time To Become Readable \(TTBR\)](https://docs.influxdata.com/influxdb/cloud/reference/internals/ttbr/), the time it takes between a write being accepted and passing through the queue so that its data is available to queries. In order to respond to a query, the compute tier needs to request relevant data from each of the 64 partitions. In order for a query to succeed, the compute **must** receive a response from every partition \(this is to ensure that incomplete results are not returned\). Each partition has multiple pods that are responsible for it, and query activity is distributed across them. ### Start of Incident On the 7th of June 2024, partition 44 started to report large increases in TTBR. This meant that, whilst customer’s writes were being safely accepted into the durable queue, they were delayed in becoming available to queries. At around the same time, alerts were received indicating an elevated query failure rate, accompanied by an increase in the query queue depth. ### Investigation Investigation showed that the pods responsible for partition 44 were periodically trying to consume more RAM than permitted, causing them to exit and report an out-of-memory \(OOM\) event. InfluxData allocated additional RAM to the pods to try to mitigate the customer-facing impact quickly. However, they continued to OOM, so the investigation moved on to identifying the source of the excessive resource usage. In a multi-tenant system, the resource usage of a single user impacting other users is known as a noisy neighbor issue. The best way to address the problem is to identify the tenant that is the source of the problematic query and temporarily block their queries, while we engage with them to correct the problematic query. In this case, the customer had automated the execution of a query that attempted to run an in-memory _sort\(\)_ against data taken from a particularly dense series. With the problematic query being submitted regularly, more RAM was consumed, until the storage pods ultimately OOMed. As a result of these regular OOMs, Kubernetes moved the pods into a CrashloopBackOff state, which lengthened the recovery time between each OOM. The extended recovery periodically caused all pods responsible for partition 44 to be offline, preventing the query tier from authoritatively answering queries. ### Actions We are working on several changes to better identify the source of incidents and reduce the likelihood of them occurring in the future. These changes include: * Better visualization to make it easier to identify small noisy neighbors * Disabling CrashLoopbackOff for storage pods once Kubernetes can do so has been added.
Status: Postmortem
Impact: Major | Started At: June 7, 2024, 10:37 p.m.
Description: # RCA **Sustained query error-rate in AWS eu-central-1 on June 7, 2024** ### Background Data stored in InfluxDB Cloud is distributed across 64 partitions. Distribution is performed using a persistent hash of the series key, with the intent that the write and query load will, on average, be distributed evenly across partitions. When an InfluxDB Cloud user writes data, their writes first go into a durable queue. So, rather than being written directly by users, storage pods consume ingest data from the other end of this queue, amongst other things, this allows writes to be accepted even if storage is encountering issues. One of the metrics used to reflect the status of this pipeline is [Time To Become Readable \(TTBR\)](https://docs.influxdata.com/influxdb/cloud/reference/internals/ttbr/), the time it takes between a write being accepted and passing through the queue so that its data is available to queries. In order to respond to a query, the compute tier needs to request relevant data from each of the 64 partitions. In order for a query to succeed, the compute **must** receive a response from every partition \(this is to ensure that incomplete results are not returned\). Each partition has multiple pods that are responsible for it, and query activity is distributed across them. ### Start of Incident On the 7th of June 2024, partition 44 started to report large increases in TTBR. This meant that, whilst customer’s writes were being safely accepted into the durable queue, they were delayed in becoming available to queries. At around the same time, alerts were received indicating an elevated query failure rate, accompanied by an increase in the query queue depth. ### Investigation Investigation showed that the pods responsible for partition 44 were periodically trying to consume more RAM than permitted, causing them to exit and report an out-of-memory \(OOM\) event. InfluxData allocated additional RAM to the pods to try to mitigate the customer-facing impact quickly. However, they continued to OOM, so the investigation moved on to identifying the source of the excessive resource usage. In a multi-tenant system, the resource usage of a single user impacting other users is known as a noisy neighbor issue. The best way to address the problem is to identify the tenant that is the source of the problematic query and temporarily block their queries, while we engage with them to correct the problematic query. In this case, the customer had automated the execution of a query that attempted to run an in-memory _sort\(\)_ against data taken from a particularly dense series. With the problematic query being submitted regularly, more RAM was consumed, until the storage pods ultimately OOMed. As a result of these regular OOMs, Kubernetes moved the pods into a CrashloopBackOff state, which lengthened the recovery time between each OOM. The extended recovery periodically caused all pods responsible for partition 44 to be offline, preventing the query tier from authoritatively answering queries. ### Actions We are working on several changes to better identify the source of incidents and reduce the likelihood of them occurring in the future. These changes include: * Better visualization to make it easier to identify small noisy neighbors * Disabling CrashLoopbackOff for storage pods once Kubernetes can do so has been added.
Status: Postmortem
Impact: Major | Started At: June 7, 2024, 10:37 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Critical | Started At: May 27, 2024, 8:25 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Critical | Started At: May 27, 2024, 8:25 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.