-
Time: Feb. 8, 2023, 3:27 a.m.
Status: Postmortem
Update: # Incident RCA
2023-01-23: Delayed writes and reads in eu-central-1
# **Summary**
Beginning in early January, a new customer workload began periodically causing high TTBR in eu-central-1. This workload was characterized by a few \(< 10/day\) large spikes of writes, sometimes upwards of 50-60MiB/s. Under normal circumstances, this would have posed little issue for a cluster of this size, however, these spikes consisted primarily of “upserts”--modifications to existing points. The process of merging the existing point values and those newly written during a spike of write traffic was highly CPU intensive, resulting in all replicas of the most impacted storage partitions falling behind on the stream of new writes, sometimes by over an hour. After each spike of write traffic, the most severely impacted partitions would take several hours to recover. Frequently, spikes during UTC business hours were separated by as little as one hour, further adding to the needed time to recover.
The engineering team identified the customer in question and attempted to mitigate the impact of these spikes of write traffic by significantly increasing the provisioned resources for each storage partition in this cluster. After these efforts proved to be unsuccessful, we worked with this customer to pause this workload. Going forward, we will be working with this customer to tune their workload.
-
Time: Feb. 8, 2023, 3:27 a.m.
Status: Postmortem
Update: # Incident RCA
2023-01-23: Delayed writes and reads in eu-central-1
# **Summary**
Beginning in early January, a new customer workload began periodically causing high TTBR in eu-central-1. This workload was characterized by a few \(< 10/day\) large spikes of writes, sometimes upwards of 50-60MiB/s. Under normal circumstances, this would have posed little issue for a cluster of this size, however, these spikes consisted primarily of “upserts”--modifications to existing points. The process of merging the existing point values and those newly written during a spike of write traffic was highly CPU intensive, resulting in all replicas of the most impacted storage partitions falling behind on the stream of new writes, sometimes by over an hour. After each spike of write traffic, the most severely impacted partitions would take several hours to recover. Frequently, spikes during UTC business hours were separated by as little as one hour, further adding to the needed time to recover.
The engineering team identified the customer in question and attempted to mitigate the impact of these spikes of write traffic by significantly increasing the provisioned resources for each storage partition in this cluster. After these efforts proved to be unsuccessful, we worked with this customer to pause this workload. Going forward, we will be working with this customer to tune their workload.
-
Time: Feb. 1, 2023, 6:38 p.m.
Status: Resolved
Update: This incident has been resolved.
-
Time: Feb. 1, 2023, 6:38 p.m.
Status: Resolved
Update: This incident has been resolved.
-
Time: Jan. 27, 2023, 7:45 p.m.
Status: Monitoring
Update: We are continuing to monitor for any further issues.
-
Time: Jan. 27, 2023, 7:45 p.m.
Status: Monitoring
Update: We are continuing to monitor for any further issues.
-
Time: Jan. 26, 2023, 10:33 p.m.
Status: Monitoring
Update: The amount of time taken for writes to become readable has returned to the normal operating level. We will continue to monitor and update.
-
Time: Jan. 26, 2023, 10:33 p.m.
Status: Monitoring
Update: The amount of time taken for writes to become readable has returned to the normal operating level. We will continue to monitor and update.
-
Time: Jan. 26, 2023, 7:53 p.m.
Status: Monitoring
Update: We are continuing to monitor for any further issues.
-
Time: Jan. 26, 2023, 7:53 p.m.
Status: Monitoring
Update: We are continuing to monitor for any further issues.
-
Time: Jan. 26, 2023, 6:38 p.m.
Status: Monitoring
Update: The region is currently experiencing elevated times for written information to become queryable along with elevated query run times. All written information is still being safely queued. We are working to minimize disruptions and we will continue to update as the situation evolves.
-
Time: Jan. 26, 2023, 6:38 p.m.
Status: Monitoring
Update: The region is currently experiencing elevated times for written information to become queryable along with elevated query run times. All written information is still being safely queued. We are working to minimize disruptions and we will continue to update as the situation evolves.
-
Time: Jan. 26, 2023, 3:06 a.m.
Status: Monitoring
Update: The storage engine has completed processing the bursted traffic. Writes are now queryable within the normal time ranges. Our team will continue to monitor.
-
Time: Jan. 26, 2023, 3:06 a.m.
Status: Monitoring
Update: The storage engine has completed processing the bursted traffic. Writes are now queryable within the normal time ranges. Our team will continue to monitor.
-
Time: Jan. 26, 2023, 2:29 a.m.
Status: Identified
Update: The region received another burst of write traffic causing backup into the queueing system. Heavily bursting traffic has now been limited on a per organization basis. All organizations impacted by this limitation have been notified. The storage system is currently processing the bursted traffic. We will update here as the storage engine completes this processing.
-
Time: Jan. 26, 2023, 2:29 a.m.
Status: Identified
Update: The region received another burst of write traffic causing backup into the queueing system. Heavily bursting traffic has now been limited on a per organization basis. All organizations impacted by this limitation have been notified. The storage system is currently processing the bursted traffic. We will update here as the storage engine completes this processing.
-
Time: Jan. 26, 2023, 1:21 a.m.
Status: Investigating
Update: We are seeing a return of the delayed reads and writes issue from a few hours ago. Our team is actively investigating.
-
Time: Jan. 25, 2023, 7:53 p.m.
Status: Monitoring
Update: We have identified the primary contributing factor for performance degradation (delayed writes and subsequent temporary read discrepancies). The impacted region has been receiving a large burst of writes twice daily which has saturated the storage layer. All writes to this region have been successfully recorded within our queuing system but are taking much longer than expected to become queryable. In response to this event, we have increased the resources available to the storage system. In addition, we have paused all tertiary background processes in the region to expedite recovery. We will continue to provide updates as the storage system recovers and will provide a complete root cause analysis when this incident has been resolved.
-
Time: Jan. 25, 2023, 6:12 p.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 25, 2023, 6:12 p.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 25, 2023, 4:06 p.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 25, 2023, 4:06 p.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 25, 2023, 2:45 p.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 25, 2023, 2:45 p.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 25, 2023, 1:50 p.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 25, 2023, 1:50 p.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 25, 2023, 9:29 a.m.
Status: Identified
Update: We have deployed some minor updates to the cluster but still continue to investigate the issue
-
Time: Jan. 25, 2023, 9:29 a.m.
Status: Identified
Update: We have deployed some minor updates to the cluster but still continue to investigate the issue
-
Time: Jan. 25, 2023, 7:32 a.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 25, 2023, 7:32 a.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 25, 2023, 5:53 a.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 25, 2023, 5:53 a.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 25, 2023, 3:44 a.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 25, 2023, 1:25 a.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 25, 2023, 1:25 a.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 24, 2023, 11:43 p.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 24, 2023, 11:43 p.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Jan. 24, 2023, 9 p.m.
Status: Identified
Update: We are continuing to work on this issue and a fix is being implemented.
-
Time: Jan. 24, 2023, 9 p.m.
Status: Identified
Update: We are continuing to work on this issue and a fix is being implemented.
-
Time: Jan. 24, 2023, 7:50 p.m.
Status: Identified
Update: The issue has been identified and a fix is still being implemented.
-
Time: Jan. 24, 2023, 7:50 p.m.
Status: Identified
Update: The issue has been identified and a fix is still being implemented.
-
Time: Jan. 24, 2023, 5:41 p.m.
Status: Identified
Update: The issue has been identified and a fix is being implemented.
-
Time: Jan. 24, 2023, 5:41 p.m.
Status: Identified
Update: The issue has been identified and a fix is being implemented.
-
Time: Jan. 24, 2023, 3:57 p.m.
Status: Investigating
Update: The AWS regions specified are experiencing delayed write/query operations and intermittent query failures.
We are continuing to investigate the issue.
-
Time: Jan. 24, 2023, 3:57 p.m.
Status: Investigating
Update: The AWS regions specified are experiencing delayed write/query operations and intermittent query failures.
We are continuing to investigate the issue.
-
Time: Jan. 24, 2023, 1:50 p.m.
Status: Investigating
Update: We are continuing to investigate this issue.
-
Time: Jan. 24, 2023, 1:50 p.m.
Status: Investigating
Update: We are continuing to investigate this issue.
-
Time: Jan. 24, 2023, 1:11 p.m.
Status: Investigating
Update: We are currently investigating this issue.
-
Time: Jan. 24, 2023, 1:11 p.m.
Status: Investigating
Update: We are currently investigating this issue.