Last checked: 5 minutes ago
Get notified about any outages, downtime or incidents for Cronofy and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Cronofy.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
API | Active |
Background Processing | Active |
Developer Dashboard | Active |
Scheduler | Active |
Conferencing Services | Active |
GoTo | Active |
Zoom | Active |
Major Calendar Providers | Active |
Apple | Active |
Active | |
Microsoft 365 | Active |
Outlook.com | Active |
View the latest incidents for Cronofy and check for official updates:
Description: Since launching support for provisioning video conferencing when creating a calendar event in 2020 we have used 8x8.vc links as an explicit option and as an anonymous, browser-based conferencing fallback when calendar-native conferencing providers such as Google Meet and Microsoft Teams are not available. 8x8 accounts for around 1% of the video conferencing we have provisioned in recent weeks, the other 99% being made up of calendar-integrated conferencing solutions like Google Meet and standalone providers like Zoom. 8x8 removed support for 8x8.vc links being used anonymously earlier this week, without any notification and with no available workaround. Ideally, we would have received prior notification from 8x8 of this change so that we could have managed a graceful transition. As that did not happen, we will regrettably be dropping support for 8x8 as a conferencing option with immediate effect. When using our API, "8x8" can be selected explicitly, or chosen as a fallback when using the "default" conferencing option. We have released a change which stops the generation of the anonymous, browser-based 8x8.vc conferencing links and so calendar events will be created, but without any conferencing details. Put another way, "8x8" will have a similar effect to providing "none", and there will no longer be a catch-all conferencing option provisioned when using "default". https://docs.cronofy.com/developers/api/conferencing-services/create/#param-conferencing.profile_id We will continue to accept both values to as to not break any existing integrations. "8x8" has been deprecated and the documentation relating to "default" has been updated to reflect this change in behavior. If you subscribe to notifications based on conferencing being provisioned, you will be notified of the failure to provision any conferencing in cases where 8x8 would previously have been used. https://docs.cronofy.com/developers/api/conferencing-services/subscriptions/ We truly regret this situation and can only apologize for the disruption this has caused. As there is no further action we are able to take on this, we are resolving this incident. If you have any further questions, please contact us at [email protected]
Status: Resolved
Impact: Critical | Started At: Oct. 20, 2023, 9:35 a.m.
Description: Calendar sync for Google-backed calendars has remained healthy since the previous message, so we are considering this as resolved. Google have updated their incident record of the underlying issue, where they likewise consider it resolved: https://www.google.com/appsstatus/dashboard/incidents/7uJZ5F1Uy4n1n74iMacQ
Status: Resolved
Impact: Minor | Started At: Sept. 21, 2023, 1:40 p.m.
Description: On Monday 22nd August between 09:09 and 09:20 UTC all API calls creating or deleting events failed. Users of the Scheduler would be unaffected, as operations were retried automatically after 09:20 UTC. This outage was caused by a bug in a change to our API request journalling which records each API request received by Cronofy. ## Timeline At 09:04 a deployment was triggered including an update to our API request journal. At 09:09 the deployment began rolling out, and the change came in to force. Seconds later, an alert was triggered and engineers began investigating. At 09:11 an additional alarm triggered for our Site Reliability team informing them of an increase in the number of failing API requests. At 09:15 with many more alerts triggering, we triggered a further deployment reversing the change. At 09:19 all deployments reverting the change completed, and the last error was observed. ## Retrospective We ask three primary questions in our retrospective: * Could we have resolved it sooner? * Could we have identified it sooner? * Could we have prevented it? After identification, the issue was resolved in approximately 4 minutes. We believe our automated deployment pipeline strikes a good balance between speed and robustness so no significant improvement can be found here. The change had been highlighted as one in a risky area and had passed code review. Due to the anticipated risk, an engineer was actively checking for errors after the deployment. It took around 6 minutes from the first error being seen to making the call to revert the change. Given the severity of the issue, this was too long and we have taken action to avoid this in future. The change was being made in a critical area of our platform. This is an area that has recently been under development. Manual testing was performed against our staging environment but failed to exercise the affected path. Our reviews focussed too heavily on the intended change in behavior. We missed the unintended side effects of the change which led to this issue. Our automated tests for this area were not as comprehensive as we thought and did not detect the bug either. ## Actions Automated tests in the area have been reviewed and expanded to provide more certainty when making changes in this area. This will prevent such changes passing review. We’ve strengthened guard clauses in this area to produce more descriptive errors, earlier, if a similar mistake were to be made in future. This will both prevent such changes passing review, and in the worst case aid faster identification of issues. We’ve altered our playbook for deploying high-risk code changes to recommend at least two engineers are present and monitoring errors and telemetry. This will improve our chances of identifying issues sooner.
Status: Postmortem
Impact: Major | Started At: Aug. 21, 2023, 9:31 a.m.
Description: On Wednesday 12th July between 14:00 and 16:15 UTC users of the Cronofy Scheduler extensions for Outlook and Zendesk would be unable to access the extension. Other instances of the Scheduler, such as the Chrome Extension, Integrations such as with Greenhouse, and the web version of the Scheduler, continued to operate normally. The underlying cause was that the Cronofy Outlook add-in and Zendesk App were not manually validated during the release of a change to the Scheduler extension. In line with our principles, we are publishing this public post-mortem to explain why this happened, and what we will do to prevent this occurring again. ## Timeline _Times are from Wednesday 12th July 2023, in UTC and rounded for clarity._ At 14:04 we deployed an update to our extensions. This had gone through our normal request and review process. At 15:49 one of our customers reported that they were unable to use the Outlook add-in to create a scheduling request. The customer observed a spinning progress wheel, and the Scheduler form did not load. At 15:54 our support engineers replicated the issue in their own Outlook add-in, and escalated the issue internally to our first responder. At 16:08 our engineering team located the problem, and identified the original change that caused the problem. At 16:10 we reverted the change, and deployed this immediately. We checked this internally to verify that this deployment corrected the problem, and the Zendesk and Outlook extensions were working again. At 16:20 the customer confirmed that the issue was resolved. ## Retrospective We ask three primary questions in our retrospective: * Could we have resolved it sooner? * Could we have identified it sooner? * Could we have prevented it? The root cause for this issue is twofold. Firstly, this area is difficult to create automated tests around, as it requires the extension to be loaded inside of Outlook or Zendesk to trigger. Secondly, and more importantly, given that we know about the lack of automated tests, we failed to manually test this change to the loading process of extension using the Outlook add-in or Zendesk App. There is a different build process that affects the Outlook and Zendesk versions of the extension, where the extension is loaded in a different way. This alternate loading method triggered a bug that did not exist in the other extensions. Once we were made aware of this issue by our customer, we resolved it in under 30 minutes. We don’t feel we can improve our response time, but we see having to be notified by a customer as a failure. From an identification perspective, we should have identified this ourselves by checking the Outlook or Zendesk extensions once we had deployed the change manually. We favour preventing the issue over earlier identification. In the future, we could have an event that triggers in the extension if the scheduler form fails to load, and informs a separate errors service. We feel that with some small improvements to the guidance we give our engineers, can prevent an issue like this from happening again. ## Actions to be taken * We will ensure that engineers are familiar with the differences between the extension build processes, making it clear which areas require manual testing. We will also cover what to be aware of when publishing changes that affect multiple different platforms at the same time. * We will create internal guidance listing all the extensions, and how to properly check each extension. * We will add an additional hint to our pull request template when extension files are being changed which specifically calls out to the engineer creating the PR and the engineers reviewing it that they should examine the impact to all extensions. We have considered adding more automated testing to this area of the solution, and we plan on discussing this in more detail within the department. Tests in this area have historically given a poor return on investment. ## Further questions? If you have any further questions, please contact us at [[email protected]](mailto:[email protected])
Status: Postmortem
Impact: Minor | Started At: July 12, 2023, 3 p.m.
Description: AWS's us-east-1 region, where our US data center is hosted, experienced an issue affecting some services Cronofy's platform relies upon. The impact to service was low, at its peak resulting in a small degradation in performance within our US data center and a handful of server errors being returned by the service. This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: June 13, 2023, 7:44 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.