Last checked: a minute ago
Get notified about any outages, downtime or incidents for Kustomer and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Kustomer.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Regional Incident | Active |
Prod1 (US) | Active |
Analytics | Active |
API | Active |
Bulk Jobs | Active |
Channel - Chat | Active |
Channel - Email | Active |
Channel - Facebook | Active |
Channel - Instagram | Active |
Channel - SMS | Active |
Channel - Twitter | Active |
Channel - WhatsApp | Active |
CSAT | Active |
Events / Audit Log | Active |
Exports | Active |
Knowledge base | Active |
Kustomer Voice | Active |
Notifications | Active |
Registration | Active |
Search | Active |
Tracking | Active |
Web Client | Active |
Web/Email/Form Hooks | Active |
Workflow | Active |
Prod2 (EU) | Active |
Analytics | Active |
API | Active |
Bulk Jobs | Active |
Channel - Chat | Active |
Channel - Email | Active |
Channel - Facebook | Active |
Channel - Instagram | Active |
Channel - SMS | Active |
Channel - Twitter | Active |
Channel - WhatsApp | Active |
CSAT | Active |
Events / Audit Log | Active |
Exports | Active |
Knowledge base | Active |
Kustomer Voice | Active |
Notifications | Active |
Registration | Active |
Search | Active |
Tracking | Active |
Web Client | Active |
Web/Email/Form Hooks | Active |
Workflow | Active |
Third Party | Active |
OpenAI | Active |
PubNub | Active |
View the latest incidents for Kustomer and check for official updates:
Description: # Post Mortem: Prod1 Workflow Latency On May 2nd 2024 # **Summary** On May 2, 2024 customers on the Prod1 cluster experienced elevated latency from workflows leading to delays in data appearing, updating and being routed in the platform **Root Cause** An internal change to reconfigure how we distributed automation traffic across Kustomer servers caused a service to become unresponsive due to excessive load, leading to a failure to automatically scale. Kustomer engineers were needed to manually scale that service and related services. # **Timeline** 2/24/24 2:29 PM EDT - Configuration change was introduced into the system shifting additional traffic onto a core service 2/24/24 2:37 PM EDT - Oncall engineer was alerted to increased latency on the core service 2/24/24 3:00 PM EDT - Root Cause was identified and engineers began manually scaling systems 2/24/24 3:15 PM EDT - Core service was healthy and began catching up against backlog of events 2/24/24 3:32 PM EDT - System fully caught up against backlog of workflow events. After ensuring stability, engineers began redriving a small # of workflow events that had failed due to latency 2/24/24 4:00 PM EDT - All events were redriven and system health was normal **Lessons/Improvements** * **Release Process:** We identified a potential improvement in how we review and release sensitive changes and will be introducing a new process to provide additional redundancy and oversight when making changes that will significantly increase traffic to a service * **Scaling Adjustments:** We identified some inefficiencies in how the related services here scale and have implemented improvements to prevent a recurrence of this pattern
Status: Postmortem
Impact: Major | Started At: May 2, 2024, 7:31 p.m.
Description: # Post Mortem: Prod1 Workflow Latency On May 2nd 2024 # **Summary** On May 2, 2024 customers on the Prod1 cluster experienced elevated latency from workflows leading to delays in data appearing, updating and being routed in the platform **Root Cause** An internal change to reconfigure how we distributed automation traffic across Kustomer servers caused a service to become unresponsive due to excessive load, leading to a failure to automatically scale. Kustomer engineers were needed to manually scale that service and related services. # **Timeline** 2/24/24 2:29 PM EDT - Configuration change was introduced into the system shifting additional traffic onto a core service 2/24/24 2:37 PM EDT - Oncall engineer was alerted to increased latency on the core service 2/24/24 3:00 PM EDT - Root Cause was identified and engineers began manually scaling systems 2/24/24 3:15 PM EDT - Core service was healthy and began catching up against backlog of events 2/24/24 3:32 PM EDT - System fully caught up against backlog of workflow events. After ensuring stability, engineers began redriving a small # of workflow events that had failed due to latency 2/24/24 4:00 PM EDT - All events were redriven and system health was normal **Lessons/Improvements** * **Release Process:** We identified a potential improvement in how we review and release sensitive changes and will be introducing a new process to provide additional redundancy and oversight when making changes that will significantly increase traffic to a service * **Scaling Adjustments:** We identified some inefficiencies in how the related services here scale and have implemented improvements to prevent a recurrence of this pattern
Status: Postmortem
Impact: Major | Started At: May 2, 2024, 7:31 p.m.
Description: # **Summary** On April 15, 2024 customers on Prod2 cluster experienced elevated latency and error rates on multiple features of the Kustomer product. **Root Cause** A bulk operation resulted in an extremely high number of events within the system in a very short period of time, and the system was initially unable to scale fast enough to handle the load, resulting in a 2 hour period of instability. # **Timeline** **Apr 15, 2024** 6:28 AM EDT Our on-call engineers were alerted to an incident of high error rates in the platform, kicking off an investigation 7:51 AM EDT Kustomer’s support team began receiving reports of a portion of agents being unable to access the platform 8:58 AM EDT The bulk operation that caused the issue was disabled by our engineers 9:55 AM EDT Latency recovered and error rates decreased to pre-incident levels 12:08 PM EDT All related services fully recovered **Lessons/Improvements** * **Bulk Jobs:** We identified a bug in our bulk job logic that could lead to larger than expected jobs running, and also identified some opportunities for improvement in how we rate limit bulk jobs and isolate them from the rest of the system. * We have fixed a bug in our bulk operations that caused the original bulk job to update many more records than expected. * We are actively evaluating improvements to the rate limiting of bulk operations and plan to implement changes in the coming weeks. * **Scaling:** We identified some inefficiencies in our scaling strategies related to recent changes in platform usage * We’ve made short term improvements to our scaling policies to increase platform stability as we investigate longer term solutions. * We are actively planning changes to isolate automations traffic in our APIs from web user traffic to prevent automations from destabilizing our web interface, and we plan to implement these changes in the coming weeks. * **Monitoring:** We are evaluating options to improve visibility of automation activity to make it easier to identify automations that are disproportionately impacting the system.
Status: Postmortem
Impact: Critical | Started At: April 15, 2024, 12:20 p.m.
Description: # **Summary** On April 15, 2024 customers on Prod2 cluster experienced elevated latency and error rates on multiple features of the Kustomer product. **Root Cause** A bulk operation resulted in an extremely high number of events within the system in a very short period of time, and the system was initially unable to scale fast enough to handle the load, resulting in a 2 hour period of instability. # **Timeline** **Apr 15, 2024** 6:28 AM EDT Our on-call engineers were alerted to an incident of high error rates in the platform, kicking off an investigation 7:51 AM EDT Kustomer’s support team began receiving reports of a portion of agents being unable to access the platform 8:58 AM EDT The bulk operation that caused the issue was disabled by our engineers 9:55 AM EDT Latency recovered and error rates decreased to pre-incident levels 12:08 PM EDT All related services fully recovered **Lessons/Improvements** * **Bulk Jobs:** We identified a bug in our bulk job logic that could lead to larger than expected jobs running, and also identified some opportunities for improvement in how we rate limit bulk jobs and isolate them from the rest of the system. * We have fixed a bug in our bulk operations that caused the original bulk job to update many more records than expected. * We are actively evaluating improvements to the rate limiting of bulk operations and plan to implement changes in the coming weeks. * **Scaling:** We identified some inefficiencies in our scaling strategies related to recent changes in platform usage * We’ve made short term improvements to our scaling policies to increase platform stability as we investigate longer term solutions. * We are actively planning changes to isolate automations traffic in our APIs from web user traffic to prevent automations from destabilizing our web interface, and we plan to implement these changes in the coming weeks. * **Monitoring:** We are evaluating options to improve visibility of automation activity to make it easier to identify automations that are disproportionately impacting the system.
Status: Postmortem
Impact: Critical | Started At: April 15, 2024, 12:20 p.m.
Description: Kustomer has redriven affected messages and Meta has resolved their status page for the component affecting WhatsApp within Kustomer: https://metastatus.com/whatsapp-business-api For any questions or concerns please reach out to [email protected]
Status: Resolved
Impact: Minor | Started At: April 3, 2024, 7:37 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.