Last checked: 7 minutes ago
Get notified about any outages, downtime or incidents for Kustomer and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Kustomer.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Regional Incident | Active |
Prod1 (US) | Active |
Analytics | Active |
API | Active |
Bulk Jobs | Active |
Channel - Chat | Active |
Channel - Email | Active |
Channel - Facebook | Active |
Channel - Instagram | Active |
Channel - SMS | Active |
Channel - Twitter | Active |
Channel - WhatsApp | Active |
CSAT | Active |
Events / Audit Log | Active |
Exports | Active |
Knowledge base | Active |
Kustomer Voice | Active |
Notifications | Active |
Registration | Active |
Search | Active |
Tracking | Active |
Web Client | Active |
Web/Email/Form Hooks | Active |
Workflow | Active |
Prod2 (EU) | Active |
Analytics | Active |
API | Active |
Bulk Jobs | Active |
Channel - Chat | Active |
Channel - Email | Active |
Channel - Facebook | Active |
Channel - Instagram | Active |
Channel - SMS | Active |
Channel - Twitter | Active |
Channel - WhatsApp | Active |
CSAT | Active |
Events / Audit Log | Active |
Exports | Active |
Knowledge base | Active |
Kustomer Voice | Active |
Notifications | Active |
Registration | Active |
Search | Active |
Tracking | Active |
Web Client | Active |
Web/Email/Form Hooks | Active |
Workflow | Active |
Third Party | Active |
OpenAI | Active |
PubNub | Active |
View the latest incidents for Kustomer and check for official updates:
Description: # **Summary** On July 27th 2023 at 11:32am ET, several components of the Kustomer platform became unavailable for organizations hosted in our Prod1 \(US\) instance. This was caused by the indirect removal of records within a single table in the database cluster that holds customer data due to an incorrectly applied automation script. The Kustomer engineering team was immediately notified and began work on restoring normal operation. At 2:40pm, the customers database was restored from a snapshot but was still operating with degraded performance. The platform began operating at full performance at 6:31pm ET, and the team shifted to re-enabling all integrations and restoring data that was unavailable due to the outage. All integrations were enabled by 9:02pm, ending 9 hours and 18 minutes of system impact with full functionality being restored. By 9:35am on July 28th, all customer records created during the outage were restored, and by 10:10pm on July 28th all backend data integrations from other systems and automations that had been unable to run during the outage were re-run to restore data. Over the weekend, the Kustomer team continued to monitor the health of the platform and identified and resolved several smaller data issues impacting a small subset of customers created during the original outage. These were fully resolved by 12:30pm on July 31st. Updates to customer records on July 27th between 10:20am - 11:40am may have been impacted. Data from client-side integrations during the incident, such as Amazon Connect, were not able to be fully restored. # **Root Cause** Our team performed a routine database migration to expand the capacity of our customers database with zero downtime which completed on July 21st. As part of the cleanup process initiated on July 27th to remove the older database table, a step in the process was not completed, and as a result, a subsequent step to cleanup the old database table resulted in deletion transactions being replicated to the new cluster. This rendered timelines and customer records inaccessible until the data was restored. The database backup restoration was delayed due to a series of challenges including issues with our database vendor’s restoration processes. # **Timeline** 07/27 11:29am ET - Customer records became inaccessible, resulting in error messages in the Kustomer platform. 07/27 11:32pm - The issue is reported to the Kustomer engineering team and they begin investigating. 07/27 12:08pm - The problem is identified and the team begins working on initiating a database restore. The initial restore begins 14 minutes later but stalls. 07/27 12:30pm - Kustomer engineers initiate discussions with our database vendor to diagnose the problems with the restore operation. 07/27 2:41 - The restored data becomes partially available, but the team encounters additional vendor related challenges during the restore which resulted in further delays. 07/27 6:31pm - Database full restore completes and the platform begins operating normally, with the exception of 404 errors when referencing customers created during the outage and prior to the restore. 07/27 9:02pm - Kustomer engineers validate that the platform is operating normally, processing automations and incoming data. At this point, with the incident resolved, the team begins to focus on monitoring to ensure the system continues to operate properly and start working through data repair. 07/28 9:30am - Customer records that were created during the outage are recreated in the system, and Kustomer engineers continue data repair efforts. 07/28 12:00pm - The Kustomer platform experiences high latency and error rate for a 10 minute period due to high load from data restoration efforts. 07/28 ~5:00pm - Searches experienced a period of high latency and occasional errors due to an unrelated incident. Kustomer will be publishing a separate post-mortem for this event. 07/28 10:10pm - All data records and automations fully restored. 07/29 10:14pm - Kustomer engineers finalize repairs to duplicate customer records created as part of the initial cleanup process. # **Lessons/Improvements** * **Database restore functionality and disaster recovery process creation** - The database restore took significantly longer than necessary due to a number of issues related to vendor specific configurations and limitations. We are working closely with our database vendor to investigate and implement alternative database restore functionality and disaster recovery processes with a goal of significantly minimizing time to restore. * **Implement technical controls as additional layers of protection in our data migration process -** We are working to automate more of our database migration processes to encode safety checks and minimize the possibility of human error. * **Close monitoring gaps** - It took a few minutes to be notified of issues with the platform. We are addressing some gaps in our monitoring that will allow us to assess impact to systems faster in the case of future incidents. * **Strengthen Documentation** - Although our processes were well documented, there is room to improve documentation further. We are updating our documentation and adding training material for the engineering team on best practices for restoring data after an incident without interrupting service. * **Resiliency and Data Recovery** - Client-side integrations do not have the same level of guarantees as our standard backend channel & application integrations. We are looking at ways to improve our Amazon Connect Integration to allow for greater resiliency and data recovery in the case of service interruptions.
Status: Postmortem
Impact: Critical | Started At: July 27, 2023, 3:44 p.m.
Description: # **Summary** On July 27th 2023 at 11:32am ET, several components of the Kustomer platform became unavailable for organizations hosted in our Prod1 \(US\) instance. This was caused by the indirect removal of records within a single table in the database cluster that holds customer data due to an incorrectly applied automation script. The Kustomer engineering team was immediately notified and began work on restoring normal operation. At 2:40pm, the customers database was restored from a snapshot but was still operating with degraded performance. The platform began operating at full performance at 6:31pm ET, and the team shifted to re-enabling all integrations and restoring data that was unavailable due to the outage. All integrations were enabled by 9:02pm, ending 9 hours and 18 minutes of system impact with full functionality being restored. By 9:35am on July 28th, all customer records created during the outage were restored, and by 10:10pm on July 28th all backend data integrations from other systems and automations that had been unable to run during the outage were re-run to restore data. Over the weekend, the Kustomer team continued to monitor the health of the platform and identified and resolved several smaller data issues impacting a small subset of customers created during the original outage. These were fully resolved by 12:30pm on July 31st. Updates to customer records on July 27th between 10:20am - 11:40am may have been impacted. Data from client-side integrations during the incident, such as Amazon Connect, were not able to be fully restored. # **Root Cause** Our team performed a routine database migration to expand the capacity of our customers database with zero downtime which completed on July 21st. As part of the cleanup process initiated on July 27th to remove the older database table, a step in the process was not completed, and as a result, a subsequent step to cleanup the old database table resulted in deletion transactions being replicated to the new cluster. This rendered timelines and customer records inaccessible until the data was restored. The database backup restoration was delayed due to a series of challenges including issues with our database vendor’s restoration processes. # **Timeline** 07/27 11:29am ET - Customer records became inaccessible, resulting in error messages in the Kustomer platform. 07/27 11:32pm - The issue is reported to the Kustomer engineering team and they begin investigating. 07/27 12:08pm - The problem is identified and the team begins working on initiating a database restore. The initial restore begins 14 minutes later but stalls. 07/27 12:30pm - Kustomer engineers initiate discussions with our database vendor to diagnose the problems with the restore operation. 07/27 2:41 - The restored data becomes partially available, but the team encounters additional vendor related challenges during the restore which resulted in further delays. 07/27 6:31pm - Database full restore completes and the platform begins operating normally, with the exception of 404 errors when referencing customers created during the outage and prior to the restore. 07/27 9:02pm - Kustomer engineers validate that the platform is operating normally, processing automations and incoming data. At this point, with the incident resolved, the team begins to focus on monitoring to ensure the system continues to operate properly and start working through data repair. 07/28 9:30am - Customer records that were created during the outage are recreated in the system, and Kustomer engineers continue data repair efforts. 07/28 12:00pm - The Kustomer platform experiences high latency and error rate for a 10 minute period due to high load from data restoration efforts. 07/28 ~5:00pm - Searches experienced a period of high latency and occasional errors due to an unrelated incident. Kustomer will be publishing a separate post-mortem for this event. 07/28 10:10pm - All data records and automations fully restored. 07/29 10:14pm - Kustomer engineers finalize repairs to duplicate customer records created as part of the initial cleanup process. # **Lessons/Improvements** * **Database restore functionality and disaster recovery process creation** - The database restore took significantly longer than necessary due to a number of issues related to vendor specific configurations and limitations. We are working closely with our database vendor to investigate and implement alternative database restore functionality and disaster recovery processes with a goal of significantly minimizing time to restore. * **Implement technical controls as additional layers of protection in our data migration process -** We are working to automate more of our database migration processes to encode safety checks and minimize the possibility of human error. * **Close monitoring gaps** - It took a few minutes to be notified of issues with the platform. We are addressing some gaps in our monitoring that will allow us to assess impact to systems faster in the case of future incidents. * **Strengthen Documentation** - Although our processes were well documented, there is room to improve documentation further. We are updating our documentation and adding training material for the engineering team on best practices for restoring data after an incident without interrupting service. * **Resiliency and Data Recovery** - Client-side integrations do not have the same level of guarantees as our standard backend channel & application integrations. We are looking at ways to improve our Amazon Connect Integration to allow for greater resiliency and data recovery in the case of service interruptions.
Status: Postmortem
Impact: Critical | Started At: July 27, 2023, 3:44 p.m.
Description: An incident affecting Search & Reporting in POD1 has been resolved. The third-party Elastic Cloud has implemented a fix on their end, and the Kustomer team has reindexed platform data where needed. Please reach out to Kustomer's support team with any questions or concerns.
Status: Resolved
Impact: Minor | Started At: July 22, 2023, 5:09 p.m.
Description: Kustomer has resolved an event affecting Knowledge Base Forms in Prod 2. After monitoring, our team has found that affected areas are fully restored. Please reach out to support at [email protected] if you have additional questions or concerns.
Status: Resolved
Impact: Minor | Started At: July 20, 2023, 4:44 p.m.
Description: Kustomer is redriving all messages to WhatsApp shortly. Please reach out to support at [email protected] if you have additional questions or concerns.
Status: Resolved
Impact: Minor | Started At: July 19, 2023, 8:32 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.