Last checked: 9 minutes ago
Get notified about any outages, downtime or incidents for Cronofy and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Cronofy.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
API | Active |
Background Processing | Active |
Developer Dashboard | Active |
Scheduler | Active |
Conferencing Services | Active |
GoTo | Active |
Zoom | Active |
Major Calendar Providers | Active |
Apple | Active |
Active | |
Microsoft 365 | Active |
Outlook.com | Active |
View the latest incidents for Cronofy and check for official updates:
Description: US data center performance has remained normal and the incident is resolved. Around 00:56 UTC inbound traffic to api.cronofy.com and app.cronofy.com began to show signs of performance degradation. This was observed to be an issue routing traffic from our load balancers to their respective target groups and on to our servers. This resulted in an increase in processing time which, in turn, resulted in some requests timing out. By 01:04 UTC the issue with the load balancers routing traffic had been resolved and traffic flow returned to usual levels. A small backlog of requests was worked through by 01:10 UTC and normal operations resumed. A postmortem of the incident will take place and be attached to this incident in the next 48 hours. If you have any queries in the interim, please contact us at [email protected].
Status: Resolved
Impact: Minor | Started At: Oct. 2, 2024, 1:42 a.m.
Description: Between 15:16-15:18 and 15:44-15:46 UTC we experienced degraded performance in our US data center. During these times, a little under 3% of requests to api.cronofy.com and app.cronofy.com resulted in a server error that potentially affected API integrators and Scheduler users. These errors coincided with an AWS issue in the North Virginia region - https://status.aws.amazon.com/#multipleservices-us-east-1_1727378355, where load balancer target groups experienced slower than normal registration times We are recording this incident retrospectively as whilst we were aware of the issue with target groups, we had a gap in our alerting that led us to believe there was no impact to customers related to it. That gap has now been filled. If you have any questions, please email [email protected].
Status: Resolved
Impact: None | Started At: Sept. 26, 2024, 3 p.m.
Description: On Monday April 22nd between 11:00 and 13:30 UTC our background processing services had a major performance degradation meaning background work was delayed for around 2 hours in some cases. This impacted operations such as synchronizing schedules to push events into calendars and to update people's availability. A change in our software's dependencies led to our background processors pulling work from queues but not processing that work as expected. This led to work messages being stuck in a state where the queues believed they were being worked on, so did not allow other background processors to perform the work instead. For a subset of the background processing during this period we had to wait for a configured timeout of 2 hours to expire, at which point the background work messages became available again, and the backlog was cleared. Full service was resumed to all data centers, including processing any delayed messages, by 13:30 UTC. Further details, lessons learned, and further actions we will be taking can be found below. ## Timeline _All times rounded for clarity and UTC_ On Monday April 22nd at 10:55 a change was merged which incorporated some minor version changes in dependencies that we use to interact with AWS services. This was to facilitate work against an AWS service we were not previously using. This change in dependencies interacted with a dependency that had not changed such that our calls to fetch work messages from AWS Simple Queue Service \(SQS\) reported as containing no messages when in fact they did. This meant that messages were being processed as far as AWS SQS was concerned \(in-flight\), but our application code did not see them in order to process them. This change went live from 10:58, with the first alert as a result of the unexpected behavior being triggered at 11:12. The bad change was reverted at 11:15 and fully removed by 11:20. This meant that background work queued between 10:58 and up to 11:20 was stuck in limbo where AWS SQS thought it was being processed. For our data centers in Australia, Canada, UK, and Singapore regular service was resumed at this point. New messages could be received and processed, and we could only wait for the messages in limbo to be released by AWS SQS to process those. In our German and US data centers we had hit a hard limit of SQS with 120,000 messages being considered "in flight" for our high priority queue. This meant that we were unable to read from those queues, but were still allowed to write to them. Once we realised and understood this issue, we released a change to route all new messages to other queues and avoid this problem. This was in place at 12:00. Whilst we were able to make changes to remove the initial problem, and avoid the effects of the secondary problem caused by hitting the hard limit. The status of the individual work messages was outside of our control. AWS SQS does not have a way to force messages back onto the queue which is the operation which we needed to resolve the issue. We looked for other alternatives but the work messages aren't accessible in any way via AWS APIs when in this state. Instead we had to wait for the configured timeout to expire, which would release the messages again. We took more direct control over capacity throughout this incident, including preparing additional capacity for the backlog of work messages being released. Once the work messages became visible after reaching their two hour timeout, we were able to process them successfully with the full service being resumed to all data centers, including processing any delayed messages, by 13:30 UTC. We then reverted changes applied during the incident to help handle it, returning things back to their regular configuration. ## Retrospective The questions we ask ourselves in an incident retrospective are: * Could it have been identified sooner? * Could it have been resolved sooner? * Could it have been prevented? Also, we don't want to focus too heavily on this specifics of an individual incident, instead look for holistic improvements alongside targeted ones. ### Could it have been identified sooner? For something with this significant an impact, it taking 12 minutes to alert us was too slow. Halving the time to alert would have significantly reduced the impact of this incident, potentially avoiding the second-order issue experienced in our German and US data centers. The false-negative nature of the behavior meant that other safeguards were not triggered. Cronofy's code was not mishandling or ignoring an error, the silent failure meant our application code was unaware of a problem. ### Could it have been resolved sooner? The key constraint on the resolution of the incident was the "in flight" timeout we had configured for the respective queues. We don't want to rush such a change to a critical part of our infrastructure but our initial analysis suggests a timeout of 15-30 minutes is likely reasonable and would have made a significant difference to the time to full service recovery. ### Cloud it have been prevented? As the cause was a change deployed by ourselves rather than an external factor, undoubtedly. In hindsight, something touching AWS-related dependencies must always be tested in our staging environment and this change was not. This would likely have led to the issue being noticed before being deployed at all. ## Actions We will be creating additional alerts around metrics that went well outside of normal bounds that would have drawn our attention much sooner. We will be reducing the timeouts configured on our AWS SQS queues to reduce the time messages are considered "in-flight" without any other interaction to align more closely with observed background processing execution times. We are changing how we reference AWS-related dependencies to make them more explicit, and alongside carrying a warning to ensure full testing is performed in our staging environment first. We will also be adding the AWS dependencies to our quarterly patching cycle to keep them contemporary, reducing the possibility of such cross-version incompatibilities. ## Further questions? If you have any further questions, please contact us at [[email protected]](mailto:[email protected])
Status: Postmortem
Impact: Major | Started At: April 22, 2024, 11:23 a.m.
Description: At 07:47 UTC, we saw a sharp increase in the number of Service Unavailable errors being returned from Apple’s calendar servers across all of our data centers, causing sync operations to fail. This was escalated to our engineering team, who investigated and found that no other calendar providers were affected and so the issue was likely not within our infrastructure. However, very few operations between Cronofy and Apple were succeeding. At 08:20 UTC, we opened an incident, to mark Apple sync as degraded, as customers may have seen an increased delay in calendar sync. This coincided with a sharp drop in the level of failed network calls, which returned to normal levels at 08:18 UTC. The service stabilized and Cronofy automatically retried failed sync operations to reconcile calendars. Over the next hour, we saw communications with Apple return to a mostly healthy state, though there were still occasional spikes in the number of errors. Cronofy continued to automatically retry failed operations, so impact on users was minimal. At 09:15 UTC, these low numbers of errors decreased back to baseline levels and stayed there. As we’ve now seen more than 30 minutes of completely healthy service, we are resolving the incident
Status: Resolved
Impact: Minor | Started At: Nov. 30, 2023, 8:20 a.m.
Description: Error rates from Apple's API have returned to normal levels, and calendar syncs for Apple-backed calendars are healthy again. Apple Calendar sync performance dropped starting at 13:36 UTC and finishing at 15:43 UTC. During this time no other calendar provider sync operations were affected.
Status: Resolved
Impact: Minor | Started At: Oct. 28, 2023, 2:07 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.