Last checked: 4 minutes ago
Get notified about any outages, downtime or incidents for Cronofy and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Cronofy.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
API | Active |
Background Processing | Active |
Developer Dashboard | Active |
Scheduler | Active |
Conferencing Services | Active |
GoTo | Active |
Zoom | Active |
Major Calendar Providers | Active |
Apple | Active |
Active | |
Microsoft 365 | Active |
Outlook.com | Active |
View the latest incidents for Cronofy and check for official updates:
Description: On Wednesday, 13th July 2022 we experienced up to 50 minutes of degraded performance in all of our data centers between 16:10 and 17:00 UTC. This was caused by an upgrade to our Kubernetes clusters \(how the Cronofy platform is hosted\) from version 1.20 to 1.21. This involves upgrading several components of which one, CoreDNS, was the source of this incident. CoreDNS was being upgraded from version 1.8.3 to 1.8.4, as this is the AWS recommended version to use with Kubernetes 1.21 hosted on Amazon's Elastic Kubernetes Service. Upgrading these components is usually a zero-downtime operation and so was being performed during working hours. Reverting the update to components, including CoreDNS, resolved the issue. This would have presented as interactions with the Cronofy platform and calendar synchronization operations taking longer than usual. For example, the 99th percentile of Cronofy API response times is usually around 0.5 seconds while during the incident it increased to around 5 seconds. Calendar synchronization operations were delayed by up to 30 minutes during the incident. Our investigations following the incident have identified that CoreDNS version 1.8.4 included a regression in behavior from 1.8.3 which caused the high level of errors within our clusters, leading to the performance degradation. We are improving our processes around such infrastructure changes to avoid such incidents in future. # Timeline _All times UTC on Wednesday, 13th July 2022 and approximate for clarity_ **16:10** Upgrade of components including CoreDNS started across all data centers. **16:15** Upgrade completed. **16:16** First alert received relating to the US data center. Manual checks show that the application was responding. **16:18** Second alert received for degraded background worker performance in CA and DE data centers. Investigations show that CPU utilization is high on all servers, in all Kubernetes clusters. Additional servers were provisioned automatically and then more added manually. **16:19** Multiple alerts being received from all data centers. **16:31** This incident was opened on our status page informing customers of the issue. We decided to rollback the component upgrade. **16:45** As the components including CoreDNS were rolled back in each data center errors dropped to normal levels and performance improved. **16:47** Rollback completed. The backlog of background work was being processed. **17:00** The backlog of background work was cleared. **17:05** Incident status changed to monitoring. **17:49** Incident closed. # Actions Although there wasn’t an outage, we certainly want to prevent this from happening again in the future. So, this lead us to ask three questions: 1. Why was this not picked up in our test environment? 2. What could we have done to identify the root cause sooner? 3. How could the impact of the change be reduced? ## Why was this not picked up in our test environment? Although this was tested in our test environment, the time between finishing the testing and deploying this to the production environments was too short. This meant that we missed that there was performance degradation introduced. We are going to review the test plan for such infrastructure changes in our test environment. This will include a soaking period, which will see us wait a set amount of time between implementing new changes in our test environment and rolling them out to the production environments. ## What could we have done to identify the root cause sooner? Previous Kubernetes upgrades had been straightforward, which led to over-confidence. Multiple infrastructure components were changed at once and so we were unable to easily identify which component was responsible. In future, we will split infrastructure component upgrades into multiple phases to help identify the cause of problems if they are to occur. ## How could the impact of the change be reduced? As mentioned above, previous Kubernetes upgrades had been straightforward, which led to over-confidence. We rolled out the component updates, including CoreDNS, to all environments in a short amount of time and it wasn’t until they had all been completed that we started to receive alerts. To prevent this from happening in the future for such changes we are going to have a phased rollout to our production environments. This will mean such an issue will only impact some environments rather than them all, reducing the impact and aiding a faster resolution. # Further questions? If you have any further questions, please contact us at [[email protected]](mailto:[email protected])
Status: Postmortem
Impact: Minor | Started At: July 13, 2022, 4:31 p.m.
Description: Cronofy's calls to Zoom's API experienced a heightened number of errors for roughly 40 minutes starting at around 14:00 UTC. Normal operation has resumed for around an hour, and our spot checks indicate that conferencing details have eventually been provisioned as expected.
Status: Resolved
Impact: Minor | Started At: June 21, 2022, 2:40 p.m.
Description: Cronofy's calls to Zoom's API experienced a heightened number of errors for roughly 40 minutes starting at around 14:00 UTC. Normal operation has resumed for around an hour, and our spot checks indicate that conferencing details have eventually been provisioned as expected.
Status: Resolved
Impact: Minor | Started At: June 21, 2022, 2:40 p.m.
Description: An internal process initiated from our centralized billing system appears to be responsible for rendering our UK data center largely unreachable between 11:04 UTC and 11:06 UTC. Our internal billing-related API was invoked at such a rate that our web servers were starved of resources for handling further requests. We will be reviewing this process and others like it to avoid such things happening in future.
Status: Resolved
Impact: None | Started At: May 3, 2022, 11:09 a.m.
Description: An internal process initiated from our centralized billing system appears to be responsible for rendering our UK data center largely unreachable between 11:04 UTC and 11:06 UTC. Our internal billing-related API was invoked at such a rate that our web servers were starved of resources for handling further requests. We will be reviewing this process and others like it to avoid such things happening in future.
Status: Resolved
Impact: None | Started At: May 3, 2022, 11:09 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.