Last checked: a minute ago
Get notified about any outages, downtime or incidents for Cronofy and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Cronofy.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
API | Active |
Background Processing | Active |
Developer Dashboard | Active |
Scheduler | Active |
Conferencing Services | Active |
GoTo | Active |
Zoom | Active |
Major Calendar Providers | Active |
Apple | Active |
Active | |
Microsoft 365 | Active |
Outlook.com | Active |
View the latest incidents for Cronofy and check for official updates:
Description: The performance of Outlook.com requests has stayed stable at levels in line with its usual performance for over an hour. At 07:54 UTC we saw a sharp increase in the number of connections failing to Outlook.com, which lasted until 09:30 UTC. This may have caused some changes to take longer to sync. Due to recent issues with Microsoft platforms, we have monitored for an extended period before resolving this issue. We will continue monitoring but we believe the issue has been resolved.
Status: Resolved
Impact: Minor | Started At: June 6, 2023, 8:47 a.m.
Description: From 19:22 to 22:27 UTC calendar sync operations to Microsoft platforms were degraded. During this time around 60% of operations between Cronofy and Microsoft resulted in failure. We took steps to mitigate the effects of this on other platforms and began monitoring the issue which has now been resolved. This was a recurrence of an earlier issue, https://status.cronofy.com/incidents/5ftvqz5tp40d.
Status: Resolved
Impact: Minor | Started At: June 5, 2023, 7:42 p.m.
Description: From 14:13 to 15:47 UTC calendar sync operations to Microsoft platforms were degraded. During this time around 69% of operations between Cronofy and Microsoft resulted in failure and led to a backlog when processing requests. We took steps to mitigate the effects of this on other platforms and began monitoring the issue which has now been resolved. We believe this is related to Microsoft alert: MO571683 (https://portal.office.com/adminportal/home?#/servicehealth/:/alerts/MO571683)
Status: Resolved
Impact: Minor | Started At: June 5, 2023, 2:31 p.m.
Description: Apple calendar synchronization has returned to normal operation and credentials have been reinstated where the invalidation was related to the incident.
Status: Resolved
Impact: Minor | Started At: May 11, 2023, 7:46 a.m.
Description: On Wednesday April 12th 2023 between 03:16 and 05:35 \(2 hours 19 minutes\) Cronofy experienced an issue causing 17% of the API traffic to our US environments to fail, returning an HTTP 500 error to the API caller. The underlying cause was that our system failed to heal following a database failover. An incorrectly configured alarm, and a gap in our playbook resulted in the issue extending longer than it should. In line with our [principles](https://www.cronofy.com/about#our-principles), we are publishing a public post-mortem to describe what has happened, why it impacted users, and what we will do to prevent it from happening in the future. ## Timeline _Times are from Wednesday April 12th 2023, in UTC and rounded for clarity._ At 03:16 the primary node that writes to the database cluster failed. The secondary node was promoted and the cluster failed over automatically, recovering as planned. This incident alerted us via our pager service, triggering an investigation, which started at 03:20. After the database failover, all visible metrics were pointing towards a successful automated resolution of the issue. Our core metrics were looking healthy and at this time, we marked the incident as recovered without manual intervention at 03:30. It is normal for some of our incidents to be resolved in this way. The on-call engineer makes a decision based on available information as to whether an incident room is necessary to handle an alarm. When a failover occurs, the original failed node is rebooted, and then becomes the read-only partner in the cluster, whilst the existing read-only secondary node is promoted to primary. Some of the connections from the web application nodes were still connected to the read-only secondary database node, but were treating it as if they were connected to the primary writable node. This lead to failures in some of the actions that were taking place in the API. At this time, our monitoring system was not alerting, as the although the metric for measuring HTTP 500 was reporting correctly, the alarm was misconfigured. When the alarm returned no data, this was viewed the same a 0 errors. This resulted in no further alarms to alert us to the degradation of the service, and reinforced the belief that the service was healthy again. At 05:00 an automated notification was posted into our internal slack monitoring channel to show that two health metrics had not fully recovered. This wasn’t an alarm level notification, so did not re-trigger the on-call pager service. At 05:30 an engineer reviewed the notification in slack, and inspected the health metrics. The increased levels of HTTP 500 errors being returned by the Cronofy API was identified. The incident was reopened, and investigation restarted. Our site reliability team was reactivated to triage the issue. At 05:35 Cronofy took the decision to replace all application instances in the US environment. An automated redeployment was triggered. This reset all the nodes connections to the database cluster, flushing out the connections to the read-only node, and returned API error responses to normal levels by 05:37. ## Retrospective We ask three primary questions in our retrospective: * Could we have resolved it sooner? * Could we have identified it sooner? * Could we have prevented it? The root cause for this issue is that our systems did not correctly self-heal when a database failover occurred. Although the database failover was the event that triggered the system to fail, it is something that is a rare, expected, and unavoidable event. The database cluster correctly recovered, and was back in an operating mode within a few minutes. Another significant factor in the severity of the impact of this issue was the robustness of Cronofy’s response to the issue. This was the first occurrence of a database failover happening outside of core working hours. During working hours, multiple engineers would see the error and respond to it, each checking a different set of metrics or systems. This collaborative approach was not correctly translated in to the guidance available in the incident response playbook, resulting in an incomplete set of checks taking place once the database failover had completed. This could definitely have been resolved sooner, by forcing the nodes to reconnect to the database. This should trigger automatically in cases of database failover, and there should also be enough information available to the issue response team to have knowledge that a wider issue is still ongoing. The lack of identification the impacts of the issue had not been fully resolved is what prolonged the incident, and this is linked to several factors: * Confirmation bias of errors trending downwards, and the early hours wake up causing the on-call engineer to miss the elevated error rate. * Misconfigured alarm adding to the confirmation bias by not making it clear that the issue was ongoing. * A reliance on tacit group knowledge instead of explicit documented steps for database failovers meaning that the on-call engineer didn’t know the additional validation and checks that would have identified the issue sooner. ## Actions to be taken We are disappointed with the length of time that it took us to resolve this incident. There were multiple smaller failures that led to this incident having a higher impact. A few different checks, technical changes or alarms being raised could have mitigated this error, and prevented the longer outage. * We are improving the way our applications self-heal in the event of a database failover. Some error messages occurred indicating that the system was trying to write to the wrong part of the database cluster, but these messages did not cause the system to reset. * We will update our guidance for our internal teams on the actions to take when a database failover occurs. This will include a more precise checklist, and specific metrics to review. * We are going through all our alarms to ensure that they are looking at the right data, at the right time. ## Further questions? If you have any further questions, please contact us at [[email protected]](mailto:[email protected])
Status: Postmortem
Impact: Major | Started At: April 12, 2023, 3 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.