Last checked: 3 minutes ago
Get notified about any outages, downtime or incidents for Kixie and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Kixie.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Call & SMS Functionality | Active |
Event API | Active |
Cadences | Active |
Cadence Functionality | Active |
Manage Cadences | Active |
Dispositions | Active |
Manage Dispositions | Active |
Hubspot | Performance Issues |
Hubspot C2C | Active |
Hubspot Call Logging | Performance Issues |
Hubspot SMS Logging | Performance Issues |
IVRs | Active |
IVR Functionality | Active |
Manage IVRs | Active |
Pipedrive | Active |
Pipedrive C2C | Active |
Pipedrive Call Logging | Active |
Pipedrive SMS Logging | Active |
Powerlists | Active |
Manage Powerlists | Active |
Powerlist Functionality | Active |
Queues | Active |
Manage Queues | Active |
Queue Functionality | Active |
Reporting | Active |
Agent Reports | Active |
Agent Summary | Active |
Business Reports | Active |
Call History | Active |
Dispositions Reporting | Active |
Inbound Summary | Active |
Queues Reporting | Active |
SMS History | Active |
SMS Reports | Active |
Time Saved Metrics | Active |
Ring Groups | Active |
Manage Ring Groups | Active |
Ring Group Functionality | Active |
Salesforce | Active |
Salesforce C2C | Active |
Salesforce Call Logging | Active |
Salesforce SMS Logging | Active |
Teams | Active |
Manage Teams | Active |
Zoho | Active |
Zoho C2C | Active |
Zoho Call Logging | Active |
Zoho SMS Logging | Active |
View the latest incidents for Kixie and check for official updates:
Description: We have identified the cause of the server issues as a mass automation from from an account and we have implemented processes and infrastructure improvements to throttle and prevent these issues in the future.
Status: Postmortem
Impact: Major | Started At: Oct. 12, 2020, 6:59 p.m.
Description: # 10/6/2020 Incident | ARR Owner | Keith Muenze | | --- | --- | | Incident | 10/06/2020 | | Priority | P0 | | Affected Services | All Services | ## Executive Summary Outbound and Inbound calling services on 10/06/2020 were interrupted due to an unusually high influx of inbound calls to a single telephone number operated by one of Kixie’s clients. Inbound calls to this number were configured to re-try a group every 6 seconds. The number of calls delivered to the number via an automated script eventually overloaded Kixie’s servers. ## AAR report | Instructions | Report | | --- | --- | | **Leadup** List the sequence of events that led to the incident. | Inbound calls can be routed to groups which in turn can call itself creating a never ending loop. This process has been replaced by our Queuing system but some client’s still use the old groups process. We helped a client of ours use this process to launch an automation which caused our system to be used in an unexpected manner. | | **Fault** Describe how the change that was implemented didn't work as expected. If available, include relevant data visualizations. | The automation caused a single groups process to be called 10k times per minute. This process is well above the typical volume of executions for our servers which cause our servers to overload and begin to queue activities. | | **Impact** Describe how internal and external users were impacted during the incident. Include how many support cases were raised. | All services were unavailable. | | **Detection** Report when the team detected the incident and how they knew it was happening. Describe how the team could've improved time to detection. | We detected the incident when a CPU alert from new relic notified our team. | | **Response** Report who responded to the incident and describe what they did at what times. Include any delays or obstacles to responding. | Keith Muenze responded to the emergency. He identified the problem after reviewing the New Relic, EC2, and RDS logs. The New Relic logs showed all servers were operating at 100% usage. He reviewed the performance insights logs in RDS to identify SQL which may be causing waits and CPU usage. No issues were discovered with the database. he reviewed the new relic transactions log to see if there was any specific increase in requests. After some research, he determined that the increase in volume was starting with our inbound call processing. Later he determined the issue was with a specific business and group. At that time, inbound calls to the group were immediately cancelled and service returned to normal. | | **Recovery** Report how the user impact was mitigated and when the incident was deemed resolved. Describe how the team could've improved time to mitigation. | Recovery could have been accelerated if new relic provided reporting which shows velocity increase of executions by function or script. Kixie could also write their own reporting to cover some of this using cloudwatch logs from the server. That level of reporting would help to identify the root cause quickly. New Relic has this data but does not show it in an immediately digestible report. | | **Timeline** Detail the incident timeline using UTC to standardize for timezones. Include lead-up events, post-impact event, and any decisions or changes made. | | | **Five whys root cause identification** Run a 5-whys analysis to understand the true causes of the incident. | | | **Blameless root cause** Note the final root cause and describe what needs to change without placing blame to prevent this class of incident from recurring. | The root cause of the outage was a lack of governance for inbound calls on a per business basis to a given group. We added thresholds for inbound groups calls to prevent this type of system abuse in the future. |
Status: Postmortem
Impact: Minor | Started At: Oct. 6, 2020, 5:28 p.m.
Description: # 10/6/2020 Incident | ARR Owner | Keith Muenze | | --- | --- | | Incident | 10/06/2020 | | Priority | P0 | | Affected Services | All Services | ## Executive Summary Outbound and Inbound calling services on 10/06/2020 were interrupted due to an unusually high influx of inbound calls to a single telephone number operated by one of Kixie’s clients. Inbound calls to this number were configured to re-try a group every 6 seconds. The number of calls delivered to the number via an automated script eventually overloaded Kixie’s servers. ## AAR report | Instructions | Report | | --- | --- | | **Leadup** List the sequence of events that led to the incident. | Inbound calls can be routed to groups which in turn can call itself creating a never ending loop. This process has been replaced by our Queuing system but some client’s still use the old groups process. We helped a client of ours use this process to launch an automation which caused our system to be used in an unexpected manner. | | **Fault** Describe how the change that was implemented didn't work as expected. If available, include relevant data visualizations. | The automation caused a single groups process to be called 10k times per minute. This process is well above the typical volume of executions for our servers which cause our servers to overload and begin to queue activities. | | **Impact** Describe how internal and external users were impacted during the incident. Include how many support cases were raised. | All services were unavailable. | | **Detection** Report when the team detected the incident and how they knew it was happening. Describe how the team could've improved time to detection. | We detected the incident when a CPU alert from new relic notified our team. | | **Response** Report who responded to the incident and describe what they did at what times. Include any delays or obstacles to responding. | Keith Muenze responded to the emergency. He identified the problem after reviewing the New Relic, EC2, and RDS logs. The New Relic logs showed all servers were operating at 100% usage. He reviewed the performance insights logs in RDS to identify SQL which may be causing waits and CPU usage. No issues were discovered with the database. he reviewed the new relic transactions log to see if there was any specific increase in requests. After some research, he determined that the increase in volume was starting with our inbound call processing. Later he determined the issue was with a specific business and group. At that time, inbound calls to the group were immediately cancelled and service returned to normal. | | **Recovery** Report how the user impact was mitigated and when the incident was deemed resolved. Describe how the team could've improved time to mitigation. | Recovery could have been accelerated if new relic provided reporting which shows velocity increase of executions by function or script. Kixie could also write their own reporting to cover some of this using cloudwatch logs from the server. That level of reporting would help to identify the root cause quickly. New Relic has this data but does not show it in an immediately digestible report. | | **Timeline** Detail the incident timeline using UTC to standardize for timezones. Include lead-up events, post-impact event, and any decisions or changes made. | | | **Five whys root cause identification** Run a 5-whys analysis to understand the true causes of the incident. | | | **Blameless root cause** Note the final root cause and describe what needs to change without placing blame to prevent this class of incident from recurring. | The root cause of the outage was a lack of governance for inbound calls on a per business basis to a given group. We added thresholds for inbound groups calls to prevent this type of system abuse in the future. |
Status: Postmortem
Impact: Minor | Started At: Oct. 6, 2020, 5:28 p.m.
Description: Incident has been resolved and all Kixie services are fully operational. We will continue to monitor the situation.
Status: Resolved
Impact: Minor | Started At: Oct. 6, 2020, 3:31 p.m.
Description: Incident has been resolved and all Kixie services are fully operational. We will continue to monitor the situation.
Status: Resolved
Impact: Minor | Started At: Oct. 6, 2020, 3:31 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.