Last checked: 7 minutes ago
Get notified about any outages, downtime or incidents for Cronofy and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Cronofy.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
API | Active |
Background Processing | Active |
Developer Dashboard | Active |
Scheduler | Active |
Conferencing Services | Active |
GoTo | Active |
Zoom | Active |
Major Calendar Providers | Active |
Apple | Active |
Active | |
Microsoft 365 | Active |
Outlook.com | Active |
View the latest incidents for Cronofy and check for official updates:
Description: On Tuesday, February 22nd 2022 our US data center experienced 95 minutes of degraded performance between 15:45 and 17:20 UTC. This was caused by the primary PostgreSQL database hitting bandwidth limits and its performance being throttled as a result. This was caused or exacerbated by PostgreSQLs internal housekeeping working on two of our largest tables at the same time. To our customers this would have surfaced as interactions with the US Cronofy platform, i.e. using the website or API, being much slower than normal. For example, the 99th percentile of API response times is usually around 0.5 seconds and during this incident peaked around 14 seconds. We have upgraded the underlying instances of this database, broadly doubling capacity and putting us far from the limit we were hitting. ## Timeline _All times UTC on Tuesday, February 22nd 2022 and approximate for clarity._ **15:45** Our primary database in our US data center started showing signs of some performance degradation. **16:05** First alert received by the on-call engineer for a potential performance issue. Attempts to reduce load on the database through interventions such as temporarily disabling some of its background housekeeping processes. **16:45** Incident opened on our status page informing customers of degraded performance in the US data center. **17:00** Began provisioning more capacity for the primary database as a fallback plan if efforts continued to be unsuccessful. **17:10** New capacity available. **17:15** Failed over to fully take advantage of the new capacity by promoting the larger node to be the writer. **17:20** Performance had returned to normal levels in the US data center. **17:45** Decided we could close the incident. **18:00** Decided to lock in the capacity change and provisioned an additional reader node at the new size. **18:15** Removed the smaller nodes from the database cluster. ## Actions Whilst there was not an outage, this felt like a close call for us. This led to three key questions: * Why had we not foreseen this capacity issue? * Could the capacity issue have been prevented? * Why had we not resolved the issue sooner? ### Foreseeing the capacity issue We had recently performed a major version upgrade on this database, and in the following weeks monitored performance pretty closely. If there was a time we should have spotted a potential issue in the near future, this was such a time. We believe we may have focussed too heavily on CPU and memory metrics in our monitoring, and it was networking capacity that led to this degradation in performance. We will be reviewing our monitoring to set alerts that would have pointed us in the right direction sooner, and also lower priority alerts that would flag an upcoming capacity issue days or weeks in advance. ### Preventing the capacity issue As PostgreSQL internal housekeeping processes appeared to contribute significantly to the problem, we will be revisiting the configuration of these process and seeing if they can be altered to reduce the likelihood of such an impact in future. ### Resolving the issue sooner As this was a performance degradation rather than an outage, the scale of the problem was not clear. This led to the on-call engineer investigating the issue whilst performance degraded further without additional alerts being raised. We will be adding additional alerts relating to performance degradation in several subsystems to raise awareness of the impact of a problem to an on-call engineer. We are also updating our guidance on incident handling for the team to encourage switching to a more visible channel for communication sooner. We are also encouraging the escalation of alerts to involve other on-call engineers in the process, particularly when the cause is not immediately clear. ## Further questions? If you have any further questions, please contact us at [[email protected]](mailto:[email protected])
Status: Postmortem
Impact: Minor | Started At: Feb. 22, 2022, 4:51 p.m.
Description: On Tuesday, February 22nd 2022 our US data center experienced 95 minutes of degraded performance between 15:45 and 17:20 UTC. This was caused by the primary PostgreSQL database hitting bandwidth limits and its performance being throttled as a result. This was caused or exacerbated by PostgreSQLs internal housekeeping working on two of our largest tables at the same time. To our customers this would have surfaced as interactions with the US Cronofy platform, i.e. using the website or API, being much slower than normal. For example, the 99th percentile of API response times is usually around 0.5 seconds and during this incident peaked around 14 seconds. We have upgraded the underlying instances of this database, broadly doubling capacity and putting us far from the limit we were hitting. ## Timeline _All times UTC on Tuesday, February 22nd 2022 and approximate for clarity._ **15:45** Our primary database in our US data center started showing signs of some performance degradation. **16:05** First alert received by the on-call engineer for a potential performance issue. Attempts to reduce load on the database through interventions such as temporarily disabling some of its background housekeeping processes. **16:45** Incident opened on our status page informing customers of degraded performance in the US data center. **17:00** Began provisioning more capacity for the primary database as a fallback plan if efforts continued to be unsuccessful. **17:10** New capacity available. **17:15** Failed over to fully take advantage of the new capacity by promoting the larger node to be the writer. **17:20** Performance had returned to normal levels in the US data center. **17:45** Decided we could close the incident. **18:00** Decided to lock in the capacity change and provisioned an additional reader node at the new size. **18:15** Removed the smaller nodes from the database cluster. ## Actions Whilst there was not an outage, this felt like a close call for us. This led to three key questions: * Why had we not foreseen this capacity issue? * Could the capacity issue have been prevented? * Why had we not resolved the issue sooner? ### Foreseeing the capacity issue We had recently performed a major version upgrade on this database, and in the following weeks monitored performance pretty closely. If there was a time we should have spotted a potential issue in the near future, this was such a time. We believe we may have focussed too heavily on CPU and memory metrics in our monitoring, and it was networking capacity that led to this degradation in performance. We will be reviewing our monitoring to set alerts that would have pointed us in the right direction sooner, and also lower priority alerts that would flag an upcoming capacity issue days or weeks in advance. ### Preventing the capacity issue As PostgreSQL internal housekeeping processes appeared to contribute significantly to the problem, we will be revisiting the configuration of these process and seeing if they can be altered to reduce the likelihood of such an impact in future. ### Resolving the issue sooner As this was a performance degradation rather than an outage, the scale of the problem was not clear. This led to the on-call engineer investigating the issue whilst performance degraded further without additional alerts being raised. We will be adding additional alerts relating to performance degradation in several subsystems to raise awareness of the impact of a problem to an on-call engineer. We are also updating our guidance on incident handling for the team to encourage switching to a more visible channel for communication sooner. We are also encouraging the escalation of alerts to involve other on-call engineers in the process, particularly when the cause is not immediately clear. ## Further questions? If you have any further questions, please contact us at [[email protected]](mailto:[email protected])
Status: Postmortem
Impact: Minor | Started At: Feb. 22, 2022, 4:51 p.m.
Description: At approximately 17:00 UTC we observed a much higher number of errors for Google calendar API calls than we would expect (mostly 503 Service Unavailable responses) across all of our data centers. There does not appear to have been a pattern to the accounts affected by this. We decided to open an incident about this at 17:10 UTC to inform of potential service degradation as it seemed like it could be a more persistent issue. Whilst opening this incident, errors when communicating with the Google calendar API returned to normal levels at around 17:12 UTC. Errors have remained at normal levels since that time and so we are resolving this incident.
Status: Resolved
Impact: Minor | Started At: Jan. 27, 2022, 5:14 p.m.
Description: At approximately 17:00 UTC we observed a much higher number of errors for Google calendar API calls than we would expect (mostly 503 Service Unavailable responses) across all of our data centers. There does not appear to have been a pattern to the accounts affected by this. We decided to open an incident about this at 17:10 UTC to inform of potential service degradation as it seemed like it could be a more persistent issue. Whilst opening this incident, errors when communicating with the Google calendar API returned to normal levels at around 17:12 UTC. Errors have remained at normal levels since that time and so we are resolving this incident.
Status: Resolved
Impact: Minor | Started At: Jan. 27, 2022, 5:14 p.m.
Description: Our Engineering team has resolved the Scheduler issue, and users can now log in again. Please get in touch with [email protected] if you have any further questions.
Status: Resolved
Impact: Major | Started At: Jan. 10, 2022, 3:52 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.