Last checked: 2 minutes ago
Get notified about any outages, downtime or incidents for Cronofy and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Cronofy.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
API | Active |
Background Processing | Active |
Developer Dashboard | Active |
Scheduler | Active |
Conferencing Services | Active |
GoTo | Active |
Zoom | Active |
Major Calendar Providers | Active |
Apple | Active |
Active | |
Microsoft 365 | Active |
Outlook.com | Active |
View the latest incidents for Cronofy and check for official updates:
Description: From 04:44 to 04:47 UTC our US data center may have been unreachable. This was caused by a short-lived database issue which was resolved without intervention.
Status: Resolved
Impact: Major | Started At: Dec. 14, 2022, 5:04 a.m.
Description: Our German data center was unreachable for two periods, 17:12-17:15 UTC and for around a minute at 17:37 UTC. In terms of symptoms, this was very similar to what we saw in our US data center in August https://status.cronofy.com/incidents/32fc8mjcr1zw We have applied changes developed to alleviate that issue to our German data center and it has been stable since.
Status: Resolved
Impact: Major | Started At: Nov. 29, 2022, 5:46 p.m.
Description: From 22:15 to 22:18 UTC Cronofy's German data center may have been unaccessible for API and web traffic. This was due to an underlying database issue which has since cleared.
Status: Resolved
Impact: Major | Started At: Nov. 23, 2022, 10 p.m.
Description: From 09:50 UTC on Wednesday October 19th 2022 through to 20:20 UTC on Wednesday October 26th 2022 \(7.5 days\) Cronofy had a bug which meant that users going through the OAuth flow for Outlook.com calendars were being incorrectly associated with accounts within Cronofy. This led to data being shared incorrectly due to the misidentification of Outlook.com accounts and the resulting API authorizations pointing towards the misidentified account rather than separate accounts. _Microsoft 365 and on-premise Exchange calendars were unaffected. Only calendars from Outlook.com, Microsoft’s more consumer-orientated offering formerly known as Hotmail and Live.com over the years._ As a data processor we have already contacted API integrators with users impacted by this issue on Thursday October 27th 2022. In the interests of transparency, in line with [Cronofy's principles](https://www.cronofy.com/about#our-principles), we are publishing a public postmortem. ## Timeline and background _Times are from October 2022, in UTC, and rounded for clarity_ At 09:50 on Wednesday 19th a change was deployed in support of work to move from using Microsoft's Outlook.com-specific API to using Microsoft's Graph API for Outlook.com accounts. This change inadvertently altered the shape of response we receive from Microsoft at the end of an OAuth authorization process, which meant Cronofy was not extracting Microsoft's unique identifier for the account correctly, instead getting a null value from the process. This broke assumptions made about identity by the rest of the system which led to the described behavior. When receiving a result from an OAuth authorization flow, we receive several values, the key of which is an unique identifier for the account, alongside an email address and the OAuth tokens. This may relate to a calendar account already within Cronofy, so we look up first by the unique identifier for the provider, then secondarily attempt a match by email. The incorrect extraction of a null value as the unique identifier for Outlook.com accounts broke an implicit contract that other parts of Cronofy’s system relied upon. This meant that the first person experiencing this bug either resolved via email address or created a new entry within Cronofy, either of which resulted in a record tied to the provider with a unique identifier of a null value. As any user passing through the flow would have a null value for this field due to the bug, every subsequent passage through the OAuth flow would resolve to this one record relating to a single Outlook.com account. Subsequently, processes downstream behaved as if the users identity had been correctly verified, leading to authorizations pointing to accounts unexpectedly. Access exposure was limited to a single Outlook.com account in each data center, but with multiple integrators having access to it. Updates to and from this calendar account were not successful after the second user resolved to the account due to safeguards in place relating to the underlying calendar IDs changing completely. This minimized the inadvertent exposure of data. With hindsight, we have identified a support ticket received at 17:50 on Wednesday 19th which likely related to this bug. At the time looked like a common issue encountered by developers when integrating and so did not trigger further action. Aside from this we have only been able to identify the support ticket which triggered our response. That ticket that triggered our response was received a week later at 14:15 on Wednesday 26th. After requesting and receiving some example accounts to investigate the described problem, we noticed something looked very odd and the alarm was raised internally at 19:50. By 20:20 we had prevented a null value unique identifier from ever being used for matching, preventing the growth of the issue. By 20:55 we had also reverted the change introduced the previous Wednesday to be entirely certain the scope of the issue would not grow. With the problem contained, we decided the best course of action would be to revoke all authorizations that had resulted from this behavior. Our view was removing some legitimately received access was better than risking leaving any illegitimate access active. Especially as users would be able to reinstate access as necessary. Work continued along this path, at first generating reports for manual verification of the intended actions, followed by taking the actions required. All identified API integrator authorizations were revoked, any potential user sessions invalidated, and the relevant Outlook.com account was deleted by 05:00 on Thursday 27th. Work continued on Thursday 27th to identify API integrators we needed to inform, along with an idea of the number of users affected to help inform their response. Those notices were sent between 17:00 and 21:30 on Thursday 27th. Throughout the following days, we worked to produce more exhaustive reports for each customer by reconciling a number of data sources. These have already been distributed to API integrators that requested them. ## Opportunities for improvement On Thursday November 3rd we held an internal retrospective relating to this incident. Whilst it was disappointing for the bad change to be deployed, it was a subtle problem that was difficult to pick up in both development and review. It is in an area where it is difficult to automate tests as it is dependent on external input, the result of a user going through an OAuth journey. It also relied on testing including both of: * Multiple Outlook.com calendar accounts being present, most local test environments only have a single calendar from each provider * Multiple passes through the OAuth process with different accounts, most manual testing will happen once, and generally against the same account This combination was not part of standard testing practices, especially for what looked like a simple change. It also relies on manual actions which, with the best will in the world, can not be trusted to be performed. Moving from the specifics of the bad change, we looked at more holistic issues. The acceptance of null as a value for Outlook.com identity was what led to the misclassification happening. This was prevented from being possible at a lower level during the handling of the incident. A null value is never expected in this situation, and we are making modifications to the Cronofy platform to assert this fact at different layers to avoid a mistake in a single location from being all that is needed to bypass this assumption. This will mean that a similar regression in future will "fail fast" rather than silently continuing as happened in this case. It is always disappointing when we find out about issues from our customers, especially one as severe as this. We looked at other signals we may have been able to alert on based upon the behavior observed during the incident. Whilst there were no errors being raised, there are metrics such as the number of times we believe a calendar account has completed an OAuth within a given period that would have stood out here. We will be doing a further investigation into such signals to understand where we may introduce things such as alerts, soft limits, and hard limits to reduce the impact of similar problems in future. Cronofy's event sourced architecture made it reasonably straightforward to review the history of the system and undo what had been done as a result. However, due to the nature of the issue, it took several days to build a clear enough picture in order to generate a PII-containing report to share with API integrators without the risk of sharing PII we should not. We're expanding our telemetry and reporting around OAuth flows to make such reconciliation more straightforward in future. Communicating with API integrators affected by the incident was a difficult, mostly manual process. This introduces the possibility of errors and delay, neither of which are desirable in the process of handling an incident. We are going to bring forward work planned to improve this process for service-related messages so we can send them directly from the Cronofy platform. To summarize the actions we are taking: * We are deepening our checks relating to identity across all providers including, but not limited to, manual testing playbooks and code-level assertions for all providers, not just Outlook.com * We will investigate detecting, and potentially preventing, behavioral anomalies relating to identity and authorization * We are enhancing our telemetry and reporting around identity and authorization processes * We will implement a new mechanism for sending service messages to customers ## Further information If you are an affected API integrator and wish to obtain a copy of your report of impacted users, please get in touch via [[email protected]](mailto:[email protected]) before Thursday December 1st 2022. As these reports contain PII we can only retain them for a short period and so will be deleting them after this date. As ever, please contact us at [[email protected]](mailto:[email protected]) if you have any further questions.
Status: Postmortem
Impact: Major | Started At: Oct. 26, 2022, 7 p.m.
Description: Late Thursday 29th September we received the first report of Microsoft Defender SmartScreen within Microsoft's Edge browser flagging our US OAuth flow endpoint (https://app.cronofy.com/oauth/authorize) as being an unsafe site. On Friday 30th September this was flagged to our engineering team who were able to reproduce this issue, submitted a dispute to Microsoft being the site owner, and opened this incident. Though we obviously believed this to be an incorrect classification, we investigated why we may have been flagged in the first place whilst we awaited a response from Microsoft. During this investigation identified an application in development mode which may have been being used as part of a phishing scam. Our guess is that they were using Cronofy's domain as a trust-worthy starting point but redirecting on to an untrustworth redirect URI after the user has granted access to their calendar. For applications in development mode we allow any redirect URI to be used to ease development, but display a warning that the application is not verified to users. It seems that users were ignoring this warning and proceeding to go through our OAuth flow to connect their calendar before being redirected on to a site posing as a financial service. We disabled the specific application and made our warning that an application is in development mode much more prominent to discourage the use of development mode applications in this way, including ensuring the warning was translated for all the locales the page supports. We had yet to hear from Microsoft, but we updated our ticket with Microsoft to let them know our finding and actions taken. At this point we were waiting on Microsoft to process our case. We did not wish to make changes that could be seen as attempting to bypass this protective mechanism as that is what a nefarious actor would do, potentially leading to the entire domain being flagged. Instead we waited on going through the proper process to get the classification corrected. We discussed potential actions to circumvent the block in case we were left with no choice to give our integrators an option that would not require their users performing a workaround involving ignoring a warning from their browser which should be legitimate the vast majority of the time. After a week of waiting we submitted a second case to Microsoft in case the first was somehow lost. Yesterday, Wednesday 12th October, we resorted to reaching out to people on social media and managed to get the attention of someone on the Microsoft Edge team who was able to get our case actioned and the flag was removed. Our US OAuth flow endpoint has not been flagged for over 12 hours now so we consider this incident resolved. We are in contact with Microsoft to better understand why we were flagged in first place to prevent similar incidents, and how we might get to a faster resolution if it happens again. Finally, thankyou to everyone who helped us by submitting a report that our site had been flagged incorrectly.
Status: Resolved
Impact: None | Started At: Sept. 30, 2022, 10:25 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.