Last checked: 6 minutes ago
Get notified about any outages, downtime or incidents for Simon Data and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Simon Data.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Simon Audience API | Active |
Simon Data Pipes | Active |
Simon Data Syncs | Active |
Simon Event Trigger Flows | Active |
Simon Web App | Active |
3rd Party Email | Active |
Iterable Global API | Active |
SendGrid API | Active |
3rd Party Push | Active |
Airship Mobile Messaging | Active |
AWS sns | Active |
3rd Party SMS | Active |
Twilio SMS Delivery Notifications & Status Callbacks | Active |
Simon Mail | Active |
Simon Mail (message sending) | Active |
Simon Mail (subscription management) | Active |
View the latest incidents for Simon Data and check for official updates:
Description: This incident has been resolved and the Drag-and-Drop Editor is fully operational.
Status: Resolved
Impact: Minor | Started At: Aug. 26, 2022, 3:53 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: Aug. 5, 2022, 2:50 p.m.
Description: # **Overview** Last week on July 27th, Amazon Web Services \(AWS\) was performing routine maintenance on a piece of infrastructure that Simon uses to power segmentation. During the maintenance window, AWS' routine failed and this left parts of Simon's segmentation functionality temporarily unavailable. This had ramifications for use of several Simon Data platform interfaces as well as on-time delivery of downstream flows & journeys. The attached file documents the steps we took to react to and remediate the issue while in parallel escalating the emergency to AWS who ultimately was able to identify the source of the issue and fix. AWS acknowledged that this bug was introduced by AWS and that the remediation they implemented on our infrastructure is a permanent solution.Executive Summary # **Executive Summary** In the early morning EST of July 27th 2022, a maintenance routine to an Amazon-managed service that Simon uses to power segmentation left that service unavailable. Simon was immediately paged and our on-call support reacted. Unfortunately, the most direct and fastest-acting remediations were not possible due to persistent hardware failure from Amazon, requiring Simon to fall back on reconstructing data on new services. In the end, Amazon rectified the bug in their maintenance program and remediation took effect before Simon finished reconstructing data on new services - rendering that effort unnecessary. No data or messages were lost, but both data refreshes and active or planned campaigns were delayed while Simon performed an emergency migration to a functional database. This resulted in delayed data refreshes and campaign launches, in addition to leaving the core segmentation product, and those products that depend upon it \(like unified contact view and selecting sample contacts in content\), to be unavailable until the migration completed. # **Root Cause** Simon Data splits data, leveraged by our platform, by kind into different databases and business logic into other databases. Each database comes equipped with multiple, redundant nodes. We routinely exercise node failover logic during upgrades, maintenance, and failures. Maintenance on active Amazon Web Services \(AWS\) databases that Simon Data uses is frequent and rarely are there unplanned outages because maintenance processes with AWS are typically predictable. On July 27th, 2022, AWS’ maintenance routine introduced a subtle bug that prevented its maintenance from finishing. As a result, the segmentation database was being reported as still under maintenance. The Simon Data on-call team was alerted and immediately reacted, however contacting our technical support at AWS took longer than expected and the Simon on-call team only received 1 downstream page. While the Simon team performed an emergency data reconstruction / migration to healthy infrastructure, the Simon team escalated through multiple teams at AWS. Despite the parallel escalations, it took longer than normal to have AWS view this incident as an emergency situation instead of an unhealthy situation. Once recognized as an emergency situation, AWS resolved quickly and this happened before Simon’s internal reconstruction / migration completed. AWS has provided their post mortem to Simon Data: “_temporary tables were inadvertently created \[by AWS\] during a maintenance period that caused a corruption of metadata and prevented the cluster from powering on”_ # **Impact Analysis** All customers using one of our specific segmentation databases saw interruptions in service in Unified Contact View \(UCV\), Segmentation, and content loading tools in Simon # **Remediation Plan** ## **Quicker Time to Detection & Alerting** Simon Data has reviewed the process we use for publishing incidents to our status page. We have cut out the manual steps which delayed the alert to our customers longer than expected. ## **Quicker Time to Resolution & Recovery** Simon Data has revisited design and usage of our segmentation databases such that a new process was added to validate that maintenance has been conducted correctly. If it doesn’t finish as intended, immediate migration to another segmentation database will occur to stem interruption of service for Simon Data customers.
Status: Postmortem
Impact: Major | Started At: July 27, 2022, 11:39 a.m.
Description: This issue is now resolved. The root cause was a build up of queued tasks, resulting in a slowdown of site performance. We have addressed the performance impact and are also working on improvements to increase our processing capacity. Thank you for your patience as we worked to resolve this!
Status: Resolved
Impact: None | Started At: June 16, 2022, 7:25 p.m.
Description: This incident has been resolved and all ETFs are now in full recovery.
Status: Resolved
Impact: Major | Started At: June 11, 2022, 5:18 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.