-
Time: Sept. 1, 2020, 7:17 p.m.
Status: Postmortem
Update: Our back-end system runs a large set of diverse queues to keep all the front-end goodies super-fast, snappy, and easy to use. The cluster of servers on which these queues reside was very rapidly exhausted over the span of a few hours this morning.
The cause was related to an instance configuration which caused a runaway condition within a single customer instance. That instance created millions of unexpected post-execution tasks. While handling millions of these is not unheard of, the size of the records on which the tasks were to be performed quickly consumed data disk solely dedicated to the queues, over 350 GB. This database has never previously exceeded 40 GB at peak.
Our alerting systems let us know less than an hour prior to the incident that we should evaluate disk usage. This type of alert is a medium priority in most cases, and would be taken care of during normal business hours. The runaway nature of the problem caused it to use up the remaining 1/4th of disk in very rapid order, saturating all remaining space for the queues.
While we would have been able to bring the end-user site up much sooner than we did, we chose to leave it down for an additional 30-45 minutes while we cleared out the offending data from the queue. Bringing our site back up without first accomplishing this deletion would have made the problem worse.
All servers in the queue cluster are now fully operational, have plenty of reserve space \(original db is < 20 GB now\), and have completed all backed-up items in their queues.
Temporary measures are in place to prevent this from happening, while we work on longer-term more permanent solutions.
Thanks for your patience and understanding. Should anyone seek additional details, please feel free to reach out directly to me, [
[email protected]](mailto:
[email protected]), or to our support team.
-
Time: Sept. 1, 2020, 7:17 p.m.
Status: Postmortem
Update: Our back-end system runs a large set of diverse queues to keep all the front-end goodies super-fast, snappy, and easy to use. The cluster of servers on which these queues reside was very rapidly exhausted over the span of a few hours this morning.
The cause was related to an instance configuration which caused a runaway condition within a single customer instance. That instance created millions of unexpected post-execution tasks. While handling millions of these is not unheard of, the size of the records on which the tasks were to be performed quickly consumed data disk solely dedicated to the queues, over 350 GB. This database has never previously exceeded 40 GB at peak.
Our alerting systems let us know less than an hour prior to the incident that we should evaluate disk usage. This type of alert is a medium priority in most cases, and would be taken care of during normal business hours. The runaway nature of the problem caused it to use up the remaining 1/4th of disk in very rapid order, saturating all remaining space for the queues.
While we would have been able to bring the end-user site up much sooner than we did, we chose to leave it down for an additional 30-45 minutes while we cleared out the offending data from the queue. Bringing our site back up without first accomplishing this deletion would have made the problem worse.
All servers in the queue cluster are now fully operational, have plenty of reserve space \(original db is < 20 GB now\), and have completed all backed-up items in their queues.
Temporary measures are in place to prevent this from happening, while we work on longer-term more permanent solutions.
Thanks for your patience and understanding. Should anyone seek additional details, please feel free to reach out directly to me, [
[email protected]](mailto:
[email protected]), or to our support team.
-
Time: Sept. 1, 2020, 7:04 p.m.
Status: Resolved
Update: All formula processing is complete, and the issue is now fully resolved.
-
Time: Sept. 1, 2020, 7:04 p.m.
Status: Resolved
Update: All formula processing is complete, and the issue is now fully resolved.
-
Time: Sept. 1, 2020, 6:04 p.m.
Status: Identified
Update: We've resolved all remaining issues. All queues are caught up except for ones with time dependent formulas. Formulas themselves should calculate immediately, including any back-end run ones for referenced data updates.
We'll clear the remaining Formula tag when all historical formulas have been calculated.
-
Time: Sept. 1, 2020, 6:04 p.m.
Status: Identified
Update: We've resolved all remaining issues. All queues are caught up except for ones with time dependent formulas. Formulas themselves should calculate immediately, including any back-end run ones for referenced data updates.
We'll clear the remaining Formula tag when all historical formulas have been calculated.
-
Time: Sept. 1, 2020, 3:33 p.m.
Status: Identified
Update: We have identified the root cause of the outage and have resolved the primary issue causing downtime. We have brought all instances back online. Live formulas will also operate as normal.
Back-end services, such as messaging and after-save actions are temporarily on pause while we ensure and validate their operation.
The root cause is related to data within one specific task of the back-end services related to after-save tasks. Once we ensure this issue will not repeat, the services will return to normal. We expect to have this fully resolved shortly
-
Time: Sept. 1, 2020, 3:33 p.m.
Status: Identified
Update: We have identified the root cause of the outage and have resolved the primary issue causing downtime. We have brought all instances back online. Live formulas will also operate as normal.
Back-end services, such as messaging and after-save actions are temporarily on pause while we ensure and validate their operation.
The root cause is related to data within one specific task of the back-end services related to after-save tasks. Once we ensure this issue will not repeat, the services will return to normal. We expect to have this fully resolved shortly
-
Time: Sept. 1, 2020, 3:24 p.m.
Status: Identified
Update: Onspring instances are back online. However, data processing is not yet fully operational (such as record saves, formula processing, messaging, etc.) We are actively working to resolve all issues and will follow up with additional information as it is available.
-
Time: Sept. 1, 2020, 3:24 p.m.
Status: Identified
Update: Onspring instances are back online. However, data processing is not yet fully operational (such as record saves, formula processing, messaging, etc.) We are actively working to resolve all issues and will follow up with additional information as it is available.
-
Time: Sept. 1, 2020, 2:45 p.m.
Status: Identified
Update: We are continuing to work on a fix for this issue and will provide an update as quickly as possible.
-
Time: Sept. 1, 2020, 2:04 p.m.
Status: Identified
Update: Onspring is currently experiencing downtime. We are actively investigating and working to resolve the issue as quickly as possible. We will follow up with more information as it is available.
-
Time: Sept. 1, 2020, 2:04 p.m.
Status: Identified
Update: Onspring is currently experiencing downtime. We are actively investigating and working to resolve the issue as quickly as possible. We will follow up with more information as it is available.
-
Time: Sept. 1, 2020, 1:53 p.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Sept. 1, 2020, 1:53 p.m.
Status: Identified
Update: We are continuing to work on a fix for this issue.
-
Time: Sept. 1, 2020, 1:47 p.m.
Status: Identified
Update: We are continuing to investigate and are working to resolve the issue as quickly as possible. Some Onspring users may continue to experience data processing issues until the issue is fully resolved.
-
Time: Sept. 1, 2020, 1:47 p.m.
Status: Identified
Update: We are continuing to investigate and are working to resolve the issue as quickly as possible. Some Onspring users may continue to experience data processing issues until the issue is fully resolved.
-
Time: Sept. 1, 2020, 1:29 p.m.
Status: Monitoring
Update: We are continuing to monitor. Onspring users may continue to experience data processing issues until the issue is fully resolved.
-
Time: Sept. 1, 2020, 1:29 p.m.
Status: Monitoring
Update: We are continuing to monitor. Onspring users may continue to experience data processing issues until the issue is fully resolved.
-
Time: Sept. 1, 2020, 1:14 p.m.
Status: Monitoring
Update: We are continuing to monitor. Onspring users may continue to experience data processing issues until the issue is fully resolved.
-
Time: Sept. 1, 2020, 1:14 p.m.
Status: Monitoring
Update: We are continuing to monitor. Onspring users may continue to experience data processing issues until the issue is fully resolved.
-
Time: Sept. 1, 2020, 1:03 p.m.
Status: Monitoring
Update: A fix has been implemented and we are monitoring the results.
-
Time: Sept. 1, 2020, 1:03 p.m.
Status: Monitoring
Update: A fix has been implemented and we are monitoring the results.
-
Time: Sept. 1, 2020, 12:52 p.m.
Status: Investigating
Update: Onspring users may be experiencing temporary data processing issues. We are actively investigating and will provide a resolution as quickly as possible.
-
Time: Sept. 1, 2020, 12:52 p.m.
Status: Investigating
Update: Onspring users may be experiencing temporary data processing issues. We are actively investigating and will provide a resolution as quickly as possible.