Last checked: a minute ago
Get notified about any outages, downtime or incidents for Cronofy and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Cronofy.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
API | Active |
Background Processing | Active |
Developer Dashboard | Active |
Scheduler | Active |
Conferencing Services | Active |
GoTo | Active |
Zoom | Active |
Major Calendar Providers | Active |
Apple | Active |
Active | |
Microsoft 365 | Active |
Outlook.com | Active |
View the latest incidents for Cronofy and check for official updates:
Description: From 16:24 UTC we saw our attempts to communicate with Apple calendars fail almost entirely. This was part of a larger issue with all Apple's services. Apple services started showing signs of recovery from 17:15 UTC. We gradually increased the level of service for Apple calendars from this time onwards, returning to usual levels around 18:00 UTC. The side effects of this incident were more significant than we would like in our US data center where 95% of our Apple calendar connections reside. We will be reviewing this and refining behavior to reduce such side effects for similar incidents in the future.
Status: Resolved
Impact: Critical | Started At: March 21, 2022, 4:39 p.m.
Description: From 16:24 UTC we saw our attempts to communicate with Apple calendars fail almost entirely. This was part of a larger issue with all Apple's services. Apple services started showing signs of recovery from 17:15 UTC. We gradually increased the level of service for Apple calendars from this time onwards, returning to usual levels around 18:00 UTC. The side effects of this incident were more significant than we would like in our US data center where 95% of our Apple calendar connections reside. We will be reviewing this and refining behavior to reduce such side effects for similar incidents in the future.
Status: Resolved
Impact: Critical | Started At: March 21, 2022, 4:39 p.m.
Description: AWS have closed their incident for the underlying issue with SQS and other services. AWS's SQS service appeared to be unavailable from around 20:47 UTC through to 20:57 UTC in our US data center, hosted in us-east-1. As SQS is our primary messaging queue between parts of the Cronofy platform, many operations will have been severely degraded during this period. We are confident that service has returned to normal as our own metrics have returned to normal levels.
Status: Resolved
Impact: Major | Started At: March 9, 2022, 8:55 p.m.
Description: All Outlook.com calendars experienced a major loss of functionality for at least 40 hours. During this period we were operating purely from our cache of their schedule. _As Microsoft’s product naming can be confusing, this only affected Outlook.com Microsoft’s more consumer-orientated offering formerly known as Hotmail and Live.com over the years. Microsoft 365 and on-premise Exchange were unaffected._ ## Timeline On Tuesday 1st March at around 23:00 UTC it appears that Microsoft made a change to their infrastructure which meant all our requests to interact with Outlook.com calendars began to fail. By Thursday 3rd March at around 15:00 UTC we managed to restore service for roughly 90% of Outlook.com calendars, approximately 40 hours later. Without any success from efforts to communicate with Microsoft, on Friday 4th March around 09:30 UTC we decided to take more drastic action to give the remaining 10% of Outlook.com calendar users a route to restoring their service by implementing a new mechanism for authorizing Cronofy’s access to their calendar. This was made available around 16:30 UTC the same day, Friday 4th March. The remaining 10% of Outlook.com calendars received a notification that they needed to reauthorize Cronofy’s access to their calendar by Friday 4th March 23:00 UTC. ## Investigation to resolution By far the most disappointing part of this incident was how long it took us to notice there was an issue. With hindsight we had received informational severity alerts shortly after 23:00 UTC on Tuesday when the issue started but this was missed by the team. For background, at Cronofy we have three levels of alert: 1. Informational 2. Review soon 3. Look now Informational alerts are delivered to a Slack channel and can cover things not needing any direct attention. This can be from an area we are interested in keeping a further eye on, or early signs of a potential issue. The next level is "review soon", these go into PagerDuty as a low severity alert that is assigned to an on-call engineer, generally for review the next working day. The highest level is "look now" where an on-call engineer is paged regardless of the time of day to investigate. Often the idea of informational and review soon alerts is to provide more color around the impact of a "look now" alert which may be triggered by a single metric. It took until our support team received a couple of support tickets on Thursday morning \(UK time\) relating to Outlook.com calendars and flagged it to our engineering team for us to realise the extent of the problem. This was roughly 36 hours after the start of the issue. Once the extent of the issue had been recognized, this public facing incident was opened. We quickly identified we were consistently receiving 503 Service Unavailable responses from Microsoft. This response code is usually indicative of a temporary issue on the service provider's side which we just have to wait out. However, as we had been seeing this for over 36 hours at this point we worked on the assumption there was something under our control that could resolve the issue. Therefore we started running experiments in alterations to our integration which may help whilst attempting to reach someone at Microsoft that may be able to resolve the underlying issue. Various sets of changes were attempted but unsuccessful until we found mention of an optional header when reviewing Microsoft's API documentation that we could add to our requests, specifically `x-AnchorMailbox`. This seemed promising as a 503 statuses are often returned by load balancers or firewalls responsible for routing requests to the correct place, headers like `x-AnchorMailbox` are often helpful for load balancers or firewalls to more easily route requests to the correct location. The addition of this header using the account's email address sprung the sync of a large number of Outlook.com calendars to life at around 15:00 UTC on Thursday. We were premature in announcing this had resolved the issue for all Outlook.com calendars, instead it was closer to 90% of Outlook.com calendars. Further efforts were made to resolve the problem for the remaining 10% of Outlook.com calendars but none bore fruit. We were able to identify that a large majority of the calendars still experiencing issues were using a custom domain for their account, but not all. Our theory was that we needed to provide the ID of the mailbox for the `x-AnchorMailbox` header due to the presence of custom domains, but this ID was not available through any of the endpoints already at our disposal from the authentication tokens we had for these users. At this point we were into the evening for the team and chose to pause our experimentation and regroup in the morning. We were at a cross-roads where we were facing the need for some drastic intervention, and we did not want to take that decision lightly. Therefore, we chose to continue trying to get a resolution from Microsoft overnight before making the call. Our integration for Outlook.com calendars had been unchanged for a long period prior to Tuesday and so we were optimistic something could be reverted on their side to fix the remaining 10% without need for drastic action on our part. Come Friday morning, 09:00 UTC, we had not had a resolution from Microsoft and the remaining 10% of Outlook.com calendars were still unable synchronize their schedules. Therefore we defined and began to execute on a contingency plan to replace our authorization mechanism for Outlook.com calendars. This was ready to go around 15:30 UTC at which point we made the call to move forward with the switch. The change was deployed around 16:00 UTC and enabled at 16:15 UTC. Around 15 minutes later we deployed a further change that would start sending the remaining 10% of Outlook.com calendars that were still experiencing issues through our relinking process. This would give us a fresh set of credentials via the new mechanism which provided us with the ID of the mailbox, not just their email address, and we expected this would resolve the issue for these remaining Outlook.com calendars. Roughly 15 minutes later we saw someone from that cohort reconnect their Outlook.com calendar and the synchronization with their calendar become healthy again, validating our theory. We continued to monitor and saw further successes building our confidence that all people with Outlook.com calendars now had a route to a successful synchronization link, albeit after their intervention in some cases. The following morning, after a review of the current status, we closed the incident. ## Opportunities for improvement By far the most significant problem within this incident was the missing high severity alerting around Outlook.com calendars. This alerting has been put in place and was already in place for all the other calendar services we support, Outlook.com had unfortunately been missed. A contributing factor to the length of time until we identified there was an incident was the timing of the informational alerts we did receive. Our engineering team is based in the UK and Europe so by 23:00 UTC no-one is actively working and so then skim the informational alerts posted overnight the following morning. This timing and process led to no-one spotting that the Outlook.com informational alerts did not have a corresponding closure message. To this end, we are also looking more holistically at our alerting to avoid such things slipping through the cracks in future. Specifically we are looking at: 1. Refining informational severity alerts that have a tendency to briefly flicker to reduce noise within alerts where no resolution is possible, eg. a side effect of ephemeral network issues and the following retry succeeding. 2. Providing visibility of informational alerts that have been open for a significant period. Both of these aim to reduce the possibility of similar alerts being missed by reducing the noise around them and increasing their signal over time. This will mean that unless alerting is entirely absent, which should never be the case, it is much less likely it will go unnoticed for anywhere near as long. We are comfortable that the time from identification to resolution of this incident was reasonable given the nature of the issue. Roughly 90% of Outlook.com calendars were successfully synchronizing within 4 hours of our investigation starting, with the remaining 10% of Outlook.com calendars being given a path to successful synchronization after we quickly turned around a major change the following day. Our deployment pipeline and tooling enabled us to investigate and experiment safely and rapidly towards the eventual solution to this issue. Whilst we communicated clearly during the incident, we did not meet our internal guidance on how frequently we provided status updates. For example, we should have provided an update by 10:00 UTC on the Friday to make it clear we were still working on the incident but did not post an update until after 13:00 UTC, nearly 20 hours after the previous update. We will be updating our internal guidance around communication, with a focus on multi-day incidents. If you have any further questions, please contact us at [[email protected]](mailto:[email protected]).
Status: Postmortem
Impact: Major | Started At: March 3, 2022, 12:14 p.m.
Description: All Outlook.com calendars experienced a major loss of functionality for at least 40 hours. During this period we were operating purely from our cache of their schedule. _As Microsoft’s product naming can be confusing, this only affected Outlook.com Microsoft’s more consumer-orientated offering formerly known as Hotmail and Live.com over the years. Microsoft 365 and on-premise Exchange were unaffected._ ## Timeline On Tuesday 1st March at around 23:00 UTC it appears that Microsoft made a change to their infrastructure which meant all our requests to interact with Outlook.com calendars began to fail. By Thursday 3rd March at around 15:00 UTC we managed to restore service for roughly 90% of Outlook.com calendars, approximately 40 hours later. Without any success from efforts to communicate with Microsoft, on Friday 4th March around 09:30 UTC we decided to take more drastic action to give the remaining 10% of Outlook.com calendar users a route to restoring their service by implementing a new mechanism for authorizing Cronofy’s access to their calendar. This was made available around 16:30 UTC the same day, Friday 4th March. The remaining 10% of Outlook.com calendars received a notification that they needed to reauthorize Cronofy’s access to their calendar by Friday 4th March 23:00 UTC. ## Investigation to resolution By far the most disappointing part of this incident was how long it took us to notice there was an issue. With hindsight we had received informational severity alerts shortly after 23:00 UTC on Tuesday when the issue started but this was missed by the team. For background, at Cronofy we have three levels of alert: 1. Informational 2. Review soon 3. Look now Informational alerts are delivered to a Slack channel and can cover things not needing any direct attention. This can be from an area we are interested in keeping a further eye on, or early signs of a potential issue. The next level is "review soon", these go into PagerDuty as a low severity alert that is assigned to an on-call engineer, generally for review the next working day. The highest level is "look now" where an on-call engineer is paged regardless of the time of day to investigate. Often the idea of informational and review soon alerts is to provide more color around the impact of a "look now" alert which may be triggered by a single metric. It took until our support team received a couple of support tickets on Thursday morning \(UK time\) relating to Outlook.com calendars and flagged it to our engineering team for us to realise the extent of the problem. This was roughly 36 hours after the start of the issue. Once the extent of the issue had been recognized, this public facing incident was opened. We quickly identified we were consistently receiving 503 Service Unavailable responses from Microsoft. This response code is usually indicative of a temporary issue on the service provider's side which we just have to wait out. However, as we had been seeing this for over 36 hours at this point we worked on the assumption there was something under our control that could resolve the issue. Therefore we started running experiments in alterations to our integration which may help whilst attempting to reach someone at Microsoft that may be able to resolve the underlying issue. Various sets of changes were attempted but unsuccessful until we found mention of an optional header when reviewing Microsoft's API documentation that we could add to our requests, specifically `x-AnchorMailbox`. This seemed promising as a 503 statuses are often returned by load balancers or firewalls responsible for routing requests to the correct place, headers like `x-AnchorMailbox` are often helpful for load balancers or firewalls to more easily route requests to the correct location. The addition of this header using the account's email address sprung the sync of a large number of Outlook.com calendars to life at around 15:00 UTC on Thursday. We were premature in announcing this had resolved the issue for all Outlook.com calendars, instead it was closer to 90% of Outlook.com calendars. Further efforts were made to resolve the problem for the remaining 10% of Outlook.com calendars but none bore fruit. We were able to identify that a large majority of the calendars still experiencing issues were using a custom domain for their account, but not all. Our theory was that we needed to provide the ID of the mailbox for the `x-AnchorMailbox` header due to the presence of custom domains, but this ID was not available through any of the endpoints already at our disposal from the authentication tokens we had for these users. At this point we were into the evening for the team and chose to pause our experimentation and regroup in the morning. We were at a cross-roads where we were facing the need for some drastic intervention, and we did not want to take that decision lightly. Therefore, we chose to continue trying to get a resolution from Microsoft overnight before making the call. Our integration for Outlook.com calendars had been unchanged for a long period prior to Tuesday and so we were optimistic something could be reverted on their side to fix the remaining 10% without need for drastic action on our part. Come Friday morning, 09:00 UTC, we had not had a resolution from Microsoft and the remaining 10% of Outlook.com calendars were still unable synchronize their schedules. Therefore we defined and began to execute on a contingency plan to replace our authorization mechanism for Outlook.com calendars. This was ready to go around 15:30 UTC at which point we made the call to move forward with the switch. The change was deployed around 16:00 UTC and enabled at 16:15 UTC. Around 15 minutes later we deployed a further change that would start sending the remaining 10% of Outlook.com calendars that were still experiencing issues through our relinking process. This would give us a fresh set of credentials via the new mechanism which provided us with the ID of the mailbox, not just their email address, and we expected this would resolve the issue for these remaining Outlook.com calendars. Roughly 15 minutes later we saw someone from that cohort reconnect their Outlook.com calendar and the synchronization with their calendar become healthy again, validating our theory. We continued to monitor and saw further successes building our confidence that all people with Outlook.com calendars now had a route to a successful synchronization link, albeit after their intervention in some cases. The following morning, after a review of the current status, we closed the incident. ## Opportunities for improvement By far the most significant problem within this incident was the missing high severity alerting around Outlook.com calendars. This alerting has been put in place and was already in place for all the other calendar services we support, Outlook.com had unfortunately been missed. A contributing factor to the length of time until we identified there was an incident was the timing of the informational alerts we did receive. Our engineering team is based in the UK and Europe so by 23:00 UTC no-one is actively working and so then skim the informational alerts posted overnight the following morning. This timing and process led to no-one spotting that the Outlook.com informational alerts did not have a corresponding closure message. To this end, we are also looking more holistically at our alerting to avoid such things slipping through the cracks in future. Specifically we are looking at: 1. Refining informational severity alerts that have a tendency to briefly flicker to reduce noise within alerts where no resolution is possible, eg. a side effect of ephemeral network issues and the following retry succeeding. 2. Providing visibility of informational alerts that have been open for a significant period. Both of these aim to reduce the possibility of similar alerts being missed by reducing the noise around them and increasing their signal over time. This will mean that unless alerting is entirely absent, which should never be the case, it is much less likely it will go unnoticed for anywhere near as long. We are comfortable that the time from identification to resolution of this incident was reasonable given the nature of the issue. Roughly 90% of Outlook.com calendars were successfully synchronizing within 4 hours of our investigation starting, with the remaining 10% of Outlook.com calendars being given a path to successful synchronization after we quickly turned around a major change the following day. Our deployment pipeline and tooling enabled us to investigate and experiment safely and rapidly towards the eventual solution to this issue. Whilst we communicated clearly during the incident, we did not meet our internal guidance on how frequently we provided status updates. For example, we should have provided an update by 10:00 UTC on the Friday to make it clear we were still working on the incident but did not post an update until after 13:00 UTC, nearly 20 hours after the previous update. We will be updating our internal guidance around communication, with a focus on multi-day incidents. If you have any further questions, please contact us at [[email protected]](mailto:[email protected]).
Status: Postmortem
Impact: Major | Started At: March 3, 2022, 12:14 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.