Last checked: 8 minutes ago
Get notified about any outages, downtime or incidents for Cronofy and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Cronofy.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
API | Active |
Background Processing | Active |
Developer Dashboard | Active |
Scheduler | Active |
Conferencing Services | Active |
GoTo | Active |
Zoom | Active |
Major Calendar Providers | Active |
Apple | Active |
Active | |
Microsoft 365 | Active |
Outlook.com | Active |
View the latest incidents for Cronofy and check for official updates:
Description: From 18:23 to 18:28 UTC we saw reachability problems for our US data center. Symptomatically this is extremely similar to the outage observed on Saturday 23rd July 2022, details of which can be found here: https://status.cronofy.com/incidents/32fc8mjcr1zw Steps are already underway to alleviate the believed root cause of this.
Status: Resolved
Impact: Major | Started At: July 28, 2022, 6:30 p.m.
Description: On Saturday, 23rd July 2022, we experienced a 12-minute outage in our US data center between 17:29 and 17:41 UTC. During this time, our API at [api.cronofy.com](http://api.cronofy.com) and our web application at [app.cronofy.com](http://app.cronofy.com) were not reachable. Any requests made are likely to have failed to connect or received a 500-range status code rather than being handled successfully. Our web application hosts the developer dashboard, Scheduler, Real-Time Scheduling pages, and end-user authorization flows. Our background processing of jobs, such as calendar synchronization, were not affected. Cronofy records all API calls into an API request table before processing. The outage was triggered when the database locked this table. Without being able to write requests to the table, all API requests began to queue up and timeout, and once the queue was full, be rejected outright. This, in turn, caused our infrastructure to mark these servers as unhealthy and take them out of service. We experienced a [very similar incident](https://status.cronofy.com/incidents/mz84qh5n29cq) in February 2021. Since that incident, we have [performed major version upgrades](https://status.cronofy.com/incidents/wzj1vnhj31zc) to our PostgreSQL clusters, and we had thought those upgrades had fixed this issue, as we had not had a recurrence for a long time. It is now clear that the major version upgrades have, unfortunately, not fixed this particular issue. To help prevent this issue from happening again, we will be making changes to how data is stored within our PostgreSQL cluster. # Timeline _All times UTC on Saturday, 23rd July 2022 and approximate for clarity_ **17:29** App and API requests began to fail **17:31** The on-call engineer is alerted to the App and API being unresponsive **17:35** Attempts to mitigate the issue are made, including launching more servers. These result in temporary improvements but do not fix this issue. **17:37** The initial alerts clear as connectivity is temporarily restored as our attempts to resolve this issue temporarily work. **17:38** New alerts are raised for the app and API being unresponsive **17:39** Incident channel created, and other engineers come online to help **17:41** This incident is created. While this is being done, telemetry shows that API and app requests are being processed again. **17:52** Incident status is changed to monitoring and we continue to investigate the root cause. **18:47** Incident status is resolved # Actions The actions for this incident fall into two categories, what we can do straight away, and what we can do in the medium/long-term. ## Short term To improve the performance of database queries we use several indexes within our PostgreSQL clusters, these help to locate the data in a fast and efficient manner. This locking issue always seems to occur when these indexes are being updated and the database gets into a state where it is waiting for some operations to resolve. Therefore, we are going to review which indexes are actively used and determine whether any can safely be removed or consolidated, as this will reduce the chances of the issue occurring by reducing the number of indexes which need updating. We are also going to look at whether we can improve our alerts to help us to identify the root cause of this type of issue faster, and give our on-call engineers a clearer signal that this is the root cause While we currently don’t have a way of resolving the issue directly \(the database eventually resolves the locks\), this will help us provide clearer messaging and faster investigations. ## Medium/long term In the medium to long term, we will review the storage of API and app requests and determine whether PostgreSQL is the correct storage technology. This is likely to lead to re-architecting how we store some types of data to ensure our service is robust in the future. ## Further questions? If you have any further questions, please contact us at [[email protected]](mailto:[email protected])
Status: Postmortem
Impact: Major | Started At: July 23, 2022, 5:41 p.m.
Description: At approximately 22:16 UTC, we observed a much higher number of errors for Google calendar API calls than we would expect (mostly no data received for events page) in our German data center. The on-call engineer was alerted to this issue at 22:32 UTC. After investigating, we decided to open an incident about this at 22:49 UTC to inform of service degradation in our German data center. While opening the incident, we were alerted about the US data center also being impacted. We saw that around 10% of Google calender API calls in our US data center were returning an error, and so the incident was updated at 22:56 UTC. Errors communicating with the Google calendar API returned to normal levels in both our German and US data centers at around 22:52 UTC. Errors have remained at normal levels since then, so we are resolving this incident. There does not appear to have been a pattern to the accounts affected by this.
Status: Resolved
Impact: Minor | Started At: July 21, 2022, 10:49 p.m.
Description: At approximately 22:16 UTC, we observed a much higher number of errors for Google calendar API calls than we would expect (mostly no data received for events page) in our German data center. The on-call engineer was alerted to this issue at 22:32 UTC. After investigating, we decided to open an incident about this at 22:49 UTC to inform of service degradation in our German data center. While opening the incident, we were alerted about the US data center also being impacted. We saw that around 10% of Google calender API calls in our US data center were returning an error, and so the incident was updated at 22:56 UTC. Errors communicating with the Google calendar API returned to normal levels in both our German and US data centers at around 22:52 UTC. Errors have remained at normal levels since then, so we are resolving this incident. There does not appear to have been a pattern to the accounts affected by this.
Status: Resolved
Impact: Minor | Started At: July 21, 2022, 10:49 p.m.
Description: On Wednesday, 13th July 2022 we experienced up to 50 minutes of degraded performance in all of our data centers between 16:10 and 17:00 UTC. This was caused by an upgrade to our Kubernetes clusters \(how the Cronofy platform is hosted\) from version 1.20 to 1.21. This involves upgrading several components of which one, CoreDNS, was the source of this incident. CoreDNS was being upgraded from version 1.8.3 to 1.8.4, as this is the AWS recommended version to use with Kubernetes 1.21 hosted on Amazon's Elastic Kubernetes Service. Upgrading these components is usually a zero-downtime operation and so was being performed during working hours. Reverting the update to components, including CoreDNS, resolved the issue. This would have presented as interactions with the Cronofy platform and calendar synchronization operations taking longer than usual. For example, the 99th percentile of Cronofy API response times is usually around 0.5 seconds while during the incident it increased to around 5 seconds. Calendar synchronization operations were delayed by up to 30 minutes during the incident. Our investigations following the incident have identified that CoreDNS version 1.8.4 included a regression in behavior from 1.8.3 which caused the high level of errors within our clusters, leading to the performance degradation. We are improving our processes around such infrastructure changes to avoid such incidents in future. # Timeline _All times UTC on Wednesday, 13th July 2022 and approximate for clarity_ **16:10** Upgrade of components including CoreDNS started across all data centers. **16:15** Upgrade completed. **16:16** First alert received relating to the US data center. Manual checks show that the application was responding. **16:18** Second alert received for degraded background worker performance in CA and DE data centers. Investigations show that CPU utilization is high on all servers, in all Kubernetes clusters. Additional servers were provisioned automatically and then more added manually. **16:19** Multiple alerts being received from all data centers. **16:31** This incident was opened on our status page informing customers of the issue. We decided to rollback the component upgrade. **16:45** As the components including CoreDNS were rolled back in each data center errors dropped to normal levels and performance improved. **16:47** Rollback completed. The backlog of background work was being processed. **17:00** The backlog of background work was cleared. **17:05** Incident status changed to monitoring. **17:49** Incident closed. # Actions Although there wasn’t an outage, we certainly want to prevent this from happening again in the future. So, this lead us to ask three questions: 1. Why was this not picked up in our test environment? 2. What could we have done to identify the root cause sooner? 3. How could the impact of the change be reduced? ## Why was this not picked up in our test environment? Although this was tested in our test environment, the time between finishing the testing and deploying this to the production environments was too short. This meant that we missed that there was performance degradation introduced. We are going to review the test plan for such infrastructure changes in our test environment. This will include a soaking period, which will see us wait a set amount of time between implementing new changes in our test environment and rolling them out to the production environments. ## What could we have done to identify the root cause sooner? Previous Kubernetes upgrades had been straightforward, which led to over-confidence. Multiple infrastructure components were changed at once and so we were unable to easily identify which component was responsible. In future, we will split infrastructure component upgrades into multiple phases to help identify the cause of problems if they are to occur. ## How could the impact of the change be reduced? As mentioned above, previous Kubernetes upgrades had been straightforward, which led to over-confidence. We rolled out the component updates, including CoreDNS, to all environments in a short amount of time and it wasn’t until they had all been completed that we started to receive alerts. To prevent this from happening in the future for such changes we are going to have a phased rollout to our production environments. This will mean such an issue will only impact some environments rather than them all, reducing the impact and aiding a faster resolution. # Further questions? If you have any further questions, please contact us at [[email protected]](mailto:[email protected])
Status: Postmortem
Impact: Minor | Started At: July 13, 2022, 4:31 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.