Last checked: 7 minutes ago
Get notified about any outages, downtime or incidents for Kustomer and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Kustomer.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Regional Incident | Active |
Prod1 (US) | Active |
Analytics | Active |
API | Active |
Bulk Jobs | Active |
Channel - Chat | Active |
Channel - Email | Active |
Channel - Facebook | Active |
Channel - Instagram | Active |
Channel - SMS | Active |
Channel - Twitter | Active |
Channel - WhatsApp | Active |
CSAT | Active |
Events / Audit Log | Active |
Exports | Active |
Knowledge base | Active |
Kustomer Voice | Active |
Notifications | Active |
Registration | Active |
Search | Active |
Tracking | Active |
Web Client | Active |
Web/Email/Form Hooks | Active |
Workflow | Active |
Prod2 (EU) | Active |
Analytics | Active |
API | Active |
Bulk Jobs | Active |
Channel - Chat | Active |
Channel - Email | Active |
Channel - Facebook | Active |
Channel - Instagram | Active |
Channel - SMS | Active |
Channel - Twitter | Active |
Channel - WhatsApp | Active |
CSAT | Active |
Events / Audit Log | Active |
Exports | Active |
Knowledge base | Active |
Kustomer Voice | Active |
Notifications | Active |
Registration | Active |
Search | Active |
Tracking | Active |
Web Client | Active |
Web/Email/Form Hooks | Active |
Workflow | Active |
Third Party | Active |
OpenAI | Active |
PubNub | Active |
View the latest incidents for Kustomer and check for official updates:
Description: After careful monitoring, our team has found that all affected areas are fully restored, and the issue with knowledge base domains has been resolved. Please reach out to support at [email protected] if you have additional questions or concerns.
Status: Resolved
Impact: Minor | Started At: Sept. 9, 2022, 4:58 p.m.
Description: # **Summary** At 9:41 AM ET on August 16th, a deployment in Kustomer’s workflow service caused an issue that prevented the service from reading a subset of queued workflow events. These events blocked the service from processing events to trigger workflows at the typical rate until the service was successfully rolled back to a previous version at 10:30 AM. This issue caused delays in processing inbound messages for Kustomer instances on PROD 1. Outbound messaging was not affected. While our primary alerting triggered as expected and our on-call engineering team began to investigate the issue, our secondary alerting failed, and our support team was not notified. This combination delayed the team’s ability to revert to our backup communication plan designed to address incidents like these so that we can notify our customer base and respond to incoming inquiries. The root cause of this issue has since been fixed, but the team continues to focus on process, automation, monitoring, and technical improvements that could prevent similar issues from occurring in the future. # **Root Cause** When releasing a change in our workflow service, an upgrade to a library dependency in our workflow service also occurred. This version had an unknown compatibility issue that only presented itself when the workflow service tried to pull “large-format” events from the event queue. This eventually led the service to stall, significantly impacting the throughput of events through the service. In our testing of the service prior to release, we did not encounter sufficient volume of these large events to replicate the issue until it hit our production services where these infrequent events were enough to block the queue. # **Timeline** 08/16 9:41 am – Workflow service deployment is fully rolled out to production, latency begins to rise 08/16 9:53 am – An alert is triggered and our Kustomer engineering team begins to investigate the issue 8/16 10:21 am – Support is alerted of the issue triggers the the incident response process 8/16 10:30 am – Workflow service deployment is rolled back, events begin to process at normal speed 8/16 10:30 am - 11am – Kustomer team adjusts and monitors workflow service to catch up and process the remaining backlog of events # **Lessons & Improvements \(Completed and Planned\)** ## **Addressing the Root Cause \+ Safeguards for Future** * Patch the library at the root of the problem and fix it to a specific working version so that it cannot be upgraded without an explicit decision to do so. _**\[DONE\]**_ * Enforce stricter / fixed versioning in all other Kustomer libraries so that they do not risk similar upgrade issues in any service. * Explore quick paths to a staged/canary deployment system for our backend worker services so that issues can be detected sooner and with minimal impact – getting earlier signals to avoid system-wide incidents like this one. * Invest in automated test suites focused around “large-format” events to ensure that issues in processing them are caught before any production traffic is impacted in any service. ## **Monitoring and Incident Response Process** * Audit all monitors in place for the workflow service \(and similar services\) with a goal to improve on the notification time for this incident. * Adjust incident response training and documentation to enforce secondary alerting. * Revisit our external communication process to identify areas where we can improve our speed to notification when we are alerted of issues.
Status: Postmortem
Impact: Major | Started At: Aug. 16, 2022, 2:30 p.m.
Description: # **Summary** At 9:41 AM ET on August 16th, a deployment in Kustomer’s workflow service caused an issue that prevented the service from reading a subset of queued workflow events. These events blocked the service from processing events to trigger workflows at the typical rate until the service was successfully rolled back to a previous version at 10:30 AM. This issue caused delays in processing inbound messages for Kustomer instances on PROD 1. Outbound messaging was not affected. While our primary alerting triggered as expected and our on-call engineering team began to investigate the issue, our secondary alerting failed, and our support team was not notified. This combination delayed the team’s ability to revert to our backup communication plan designed to address incidents like these so that we can notify our customer base and respond to incoming inquiries. The root cause of this issue has since been fixed, but the team continues to focus on process, automation, monitoring, and technical improvements that could prevent similar issues from occurring in the future. # **Root Cause** When releasing a change in our workflow service, an upgrade to a library dependency in our workflow service also occurred. This version had an unknown compatibility issue that only presented itself when the workflow service tried to pull “large-format” events from the event queue. This eventually led the service to stall, significantly impacting the throughput of events through the service. In our testing of the service prior to release, we did not encounter sufficient volume of these large events to replicate the issue until it hit our production services where these infrequent events were enough to block the queue. # **Timeline** 08/16 9:41 am – Workflow service deployment is fully rolled out to production, latency begins to rise 08/16 9:53 am – An alert is triggered and our Kustomer engineering team begins to investigate the issue 8/16 10:21 am – Support is alerted of the issue triggers the the incident response process 8/16 10:30 am – Workflow service deployment is rolled back, events begin to process at normal speed 8/16 10:30 am - 11am – Kustomer team adjusts and monitors workflow service to catch up and process the remaining backlog of events # **Lessons & Improvements \(Completed and Planned\)** ## **Addressing the Root Cause \+ Safeguards for Future** * Patch the library at the root of the problem and fix it to a specific working version so that it cannot be upgraded without an explicit decision to do so. _**\[DONE\]**_ * Enforce stricter / fixed versioning in all other Kustomer libraries so that they do not risk similar upgrade issues in any service. * Explore quick paths to a staged/canary deployment system for our backend worker services so that issues can be detected sooner and with minimal impact – getting earlier signals to avoid system-wide incidents like this one. * Invest in automated test suites focused around “large-format” events to ensure that issues in processing them are caught before any production traffic is impacted in any service. ## **Monitoring and Incident Response Process** * Audit all monitors in place for the workflow service \(and similar services\) with a goal to improve on the notification time for this incident. * Adjust incident response training and documentation to enforce secondary alerting. * Revisit our external communication process to identify areas where we can improve our speed to notification when we are alerted of issues.
Status: Postmortem
Impact: Major | Started At: Aug. 16, 2022, 2:30 p.m.
Description: # **Summary** On August 10th, 2022, at 3:20 AM EDT, the Kustomer team started to receive transient alerts for a high error rate in the Instagram service that was caused by errors in our requests to the Instagram API. After discovering the root cause, the incident was resolved at approximately 2:20 PM EDT. # **Impact** Users of the application across organizations would have been unable to send messages via Instagram that specifically included a link inside of the message body from 3:20 AM - 2:20 PM EDT. Outbound Instagram messages that did not contain a link would still have been successful during this time. # **Next Steps** The engineering team will be updating the runbook for handling potential Instagram API issues to increase the speed with which we identify the root cause of any potential issues.
Status: Postmortem
Impact: Minor | Started At: Aug. 10, 2022, 1:16 p.m.
Description: # **Summary** On August 10th, 2022, at 3:20 AM EDT, the Kustomer team started to receive transient alerts for a high error rate in the Instagram service that was caused by errors in our requests to the Instagram API. After discovering the root cause, the incident was resolved at approximately 2:20 PM EDT. # **Impact** Users of the application across organizations would have been unable to send messages via Instagram that specifically included a link inside of the message body from 3:20 AM - 2:20 PM EDT. Outbound Instagram messages that did not contain a link would still have been successful during this time. # **Next Steps** The engineering team will be updating the runbook for handling potential Instagram API issues to increase the speed with which we identify the root cause of any potential issues.
Status: Postmortem
Impact: Minor | Started At: Aug. 10, 2022, 1:16 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.