Last checked: 3 minutes ago
Get notified about any outages, downtime or incidents for Dyspatch and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Dyspatch.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
API | Active |
Dashboard | Active |
Device Previews | Active |
Preview Emails | Active |
SparkPost Transmissions API - USA | Active |
View the latest incidents for Dyspatch and check for official updates:
Description: This incident has been resolved.
Status: Resolved
Impact: None | Started At: Nov. 14, 2024, 3:38 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Major | Started At: Sept. 11, 2024, 8:40 p.m.
Description: # **Post Mortem** - **April 8 2024 Dyspatch Outage Intro** On April 8, 2024, Dyspatch was unavailable between the hours of 12:30PM and 01:00AM Pacific time due to an issue that occurred during a routine upgrade of Dyspatch's infrastructure. This post mortem aims to analyze the root causes of the outage, assess its impact on our services, and outline steps Dyspatch is taking to prevent similar incidents in the future. ## **Timeline \(Pacific Time\)** **11:35 -** We begin the upgrade **12:10 -** The production cluster intermittently returns 503s for users. Dyspatch's services cannot communicate with each other. **12:17 -** We attempt to rollback the changes. **12:30 -** We identify the problem: the internal authentication mechanism our services use to communicate securely is out of sync across services. **12:30 - 17:30 -** We try several strategies to bring production online. **17:30** - To avoid further impact to our production environment, work begins on our staging environment. **18:17 -** We identify that previous changes were made to our staging environment without getting applied to our production environment. **21:16 -** Staging is online. We begin applying the changes from our staging environment to our production environment. **00:56 -** Dyspatch is available again. ## **Why did this happen? What did we learn?** During the outage we ran into several challenges trying to restore service. We discovered that a previous update to a critical component of our infrastructure was applied only to our staging environment. It was quickly determined that the issue was an authentication misalignment between Dyspatch's services which meant that our various services could not communicate with each other. We learned that we did not have a way to generate new credentials without taking the services that manage our cluster offline. After we determined that critical services had to be taken offline we switched to testing on our staging environment to prevent data loss in our production environment. Ultimately a difference in our production and staging environment had knock-on effects affecting our ability to rollback and recover quickly. ## **What are we doing about it?** There are several actions we intend to take to prevent similar issues from happening: 1. We immediately aligned our staging and production environments to ensure that any infrastructure testing done in staging will be the same when applied to our production environment. The root cause of this outage came from a difference in environments and this ensures that we can be confident when testing required infrastructure changes. 2. We plan to invest in tooling to help us automatically catch and audit any drift between our environments. Catching the difference beforehand would have prevented this incident. 3. We are investing in tooling and processes to help us rebuild our cluster more reliably and quickly. We had to spend time migrating changes from our staging environment to our production environment when trying to restore Dyspatch. ## **Summary** Finally, we want to apologize. We know Dyspatch is important for supporting our customers' communications. Your patience and support mean a great deal to us and we appreciate everyone who reached out to our team. Like with any operational issue, we will spend time in the coming days and weeks to understand the details of the event and make improvements mentioned above to our infrastructure and processes.
Status: Postmortem
Impact: Major | Started At: April 8, 2024, 7:18 p.m.
Description: Mobile previews in the email builder were not working. Desktop previews and email editing in general were unaffected. A fix has been implemented and deployed.
Status: Resolved
Impact: None | Started At: Nov. 30, 2023, 9:30 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Major | Started At: Nov. 24, 2023, 3:28 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.