Last checked: 4 minutes ago
Get notified about any outages, downtime or incidents for Kentik SaaS US Cluster and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Kentik SaaS US Cluster.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Alerting and Mitigation Services | Active |
Flow Ingest | Active |
NMS | Active |
Notifications | Active |
Query | Active |
REST API | Active |
Web Portal | Active |
BGP | Active |
BGP Monitoring and Alerting | Active |
BGP Peering and Enrichment | Active |
Cloud Ingest | Active |
AWS Ingest | Active |
Azure Ingest | Active |
GCP Ingest | Active |
Synthetics | Active |
Synthetics Alerting | Active |
Synthetics Ingest | Active |
View the latest incidents for Kentik SaaS US Cluster and check for official updates:
Description: **ROOT CAUSE** Our inbound proxy/loadbalancer was configured for a small concurrent connection pool for these API paths. At approximately 13:30 UTC, a few high volume paths were brought online that filled this pool and caused requests to queue to a point where we could not catch up. This caused periodic 503 and 429 responses from our API. **RESOLUTION** At approximately 17:45 UTC, the connection pool size was increased to address this issue. We have raised the severity of the internal alerts we have monitoring these metrics to more quickly identify and resolve future similar events.
Status: Postmortem
Impact: Major | Started At: Oct. 20, 2022, 1:30 p.m.
Description: **ROOT CAUSE** This incident was part of a series of incidents caused by bottlenecking in a load balancing system we placed in front of our query engine on 2022-09-01. This load balancer is shared across many of our underlying services, so many upstream Kentik portal pages were affected in different ways. The bottlenecking only occurred during peak query usage, at which time the load balancer would begin hitting its global connection limits. **RESOLUTION** Because this issue only occurred during our peak query times, it took us much longer than desired to identify the pattern and isolate a root cause. Each business day starting 2022-09-06, we would see elevated response times around the same time of day, but no obvious culprits based on metrics, logs, or traces. For the first few days, Kentik Engineering teams were identifying potential performance bottlenecks in various software services based on trace data, rolled out patches, and saw improved response times. While these changes did result in improved performance of various services, the observed improvement in response times immediately following patch deployments were false positives due to the patches rolling out during off peak hours and the root issue actually coinciding with our query peak. After hitting our query peaks on 2022-09-06 to 2022-09-08, we began to see the pattern emerge, but still could not clearly point at a root cause. The biggest blocker to identifying the root cause was that our load balancer was not reporting the bottlenecking in any fashion. In fact, when a Kentik Portal user loaded a page that ran a request that went through this load balancer, we would see nominal response times reported by the load balancer, but elevated response times reported by the web server. This led us to believe there was a performance issue on our web servers and focus much of our efforts there for the first few days. In addition to software improvements, the team allocated 66% more hardware capacity for our web servers, hoping this would buy us headroom to identify the true root cause, but to no avail. It was only after looking back at macro trends several days into the incident and seeing a very slight decrease in overall responsiveness and increased error rates that coincided with our load balancer changes that we began to investigate it as a potential root cause. Our load balancer employs several concurrency limits, and the addition of query load to it caused us to hit these limits during query peaks. We could clearly see this in concurrent connection metrics, but did not have monitoring for this scenario, nor did the load balancer log or otherwise indicate this was occurring. It would queue requests and silently incur delays while reporting nominal request and response times in its latency metrics. On 2022-09-15, Kentik Engineering removed the query load from this load balancer, and performance returned to consistently nominal levels. However, doing this rollback in conjunction with rapidly deploying new hardware for the web portal caused different bottlenecks in our query system during query peaks – ones that we were anticipating and trying to get ahead of by putting the load balancer in play in the first place. On 2022-09-21, Kentik Engineering was able to get all affected systems into a nominal state in terms of query performance and overall latency. **FOLLOW UP** The team is now focused on adding several layers of observability to our platform in order to improve our ability to respond to these types of incidents. In addition to more thorough monitoring of all components of our infrastructure, we are focused on identifying performance issues more proactively. During Q4 2022, our team will be working towards: * Adding more tracing to the Kentik Portal itself in order to get more visibility into browser-side/browser-observed performance * Leveraging Kentik Synthetics to actively monitor performance of key workflows in the Kentik Portal * Increasing our usage of Kentik Host Monitoring to more quickly identify performance issues via Kentik Alerting Please contact your Customer Success team or [[email protected]](mailto:[email protected]) if you have any further questions or concerns.
Status: Postmortem
Impact: Minor | Started At: Sept. 21, 2022, 6:15 p.m.
Description: **ROOT CAUSE** This incident was part of a series of incidents caused by bottlenecking in a load balancing system we placed in front of our query engine on 2022-09-01. This load balancer is shared across many of our underlying services, so many upstream Kentik portal pages were affected in different ways. The bottlenecking only occurred during peak query usage, at which time the load balancer would begin hitting its global connection limits. **RESOLUTION** Because this issue only occurred during our peak query times, it took us much longer than desired to identify the pattern and isolate a root cause. Each business day starting 2022-09-06, we would see elevated response times around the same time of day, but no obvious culprits based on metrics, logs, or traces. For the first few days, Kentik Engineering teams were identifying potential performance bottlenecks in various software services based on trace data, rolled out patches, and saw improved response times. While these changes did result in improved performance of various services, the observed improvement in response times immediately following patch deployments were false positives due to the patches rolling out during off peak hours and the root issue actually coinciding with our query peak. After hitting our query peaks on 2022-09-06 to 2022-09-08, we began to see the pattern emerge, but still could not clearly point at a root cause. The biggest blocker to identifying the root cause was that our load balancer was not reporting the bottlenecking in any fashion. In fact, when a Kentik Portal user loaded a page that ran a request that went through this load balancer, we would see nominal response times reported by the load balancer, but elevated response times reported by the web server. This led us to believe there was a performance issue on our web servers and focus much of our efforts there for the first few days. In addition to software improvements, the team allocated 66% more hardware capacity for our web servers, hoping this would buy us headroom to identify the true root cause, but to no avail. It was only after looking back at macro trends several days into the incident and seeing a very slight decrease in overall responsiveness and increased error rates that coincided with our load balancer changes that we began to investigate it as a potential root cause. Our load balancer employs several concurrency limits, and the addition of query load to it caused us to hit these limits during query peaks. We could clearly see this in concurrent connection metrics, but did not have monitoring for this scenario, nor did the load balancer log or otherwise indicate this was occurring. It would queue requests and silently incur delays while reporting nominal request and response times in its latency metrics. On 2022-09-15, Kentik Engineering removed the query load from this load balancer, and performance returned to consistently nominal levels. However, doing this rollback in conjunction with rapidly deploying new hardware for the web portal caused different bottlenecks in our query system during query peaks – ones that we were anticipating and trying to get ahead of by putting the load balancer in play in the first place. On 2022-09-21, Kentik Engineering was able to get all affected systems into a nominal state in terms of query performance and overall latency. **FOLLOW UP** The team is now focused on adding several layers of observability to our platform in order to improve our ability to respond to these types of incidents. In addition to more thorough monitoring of all components of our infrastructure, we are focused on identifying performance issues more proactively. During Q4 2022, our team will be working towards: * Adding more tracing to the Kentik Portal itself in order to get more visibility into browser-side/browser-observed performance * Leveraging Kentik Synthetics to actively monitor performance of key workflows in the Kentik Portal * Increasing our usage of Kentik Host Monitoring to more quickly identify performance issues via Kentik Alerting Please contact your Customer Success team or [[email protected]](mailto:[email protected]) if you have any further questions or concerns.
Status: Postmortem
Impact: Minor | Started At: Sept. 21, 2022, 6:15 p.m.
Description: **ROOT CAUSE** This incident was part of a series of incidents caused by bottlenecking in a load balancing system we placed in front of our query engine on 2022-09-01. This load balancer is shared across many of our underlying services, so many upstream Kentik portal pages were affected in different ways. The bottlenecking only occurred during peak query usage, at which time the load balancer would begin hitting its global connection limits. **RESOLUTION** Because this issue only occurred during our peak query times, it took us much longer than desired to identify the pattern and isolate a root cause. Each business day starting 2022-09-06, we would see elevated response times around the same time of day, but no obvious culprits based on metrics, logs, or traces. For the first few days, Kentik Engineering teams were identifying potential performance bottlenecks in various software services based on trace data, rolled out patches, and saw improved response times. While these changes did result in improved performance of various services, the observed improvement in response times immediately following patch deployments were false positives due to the patches rolling out during off peak hours and the root issue actually coinciding with our query peak. After hitting our query peaks on 2022-09-06 to 2022-09-08, we began to see the pattern emerge, but still could not clearly point at a root cause. The biggest blocker to identifying the root cause was that our load balancer was not reporting the bottlenecking in any fashion. In fact, when a Kentik Portal user loaded a page that ran a request that went through this load balancer, we would see nominal response times reported by the load balancer, but elevated response times reported by the web server. This led us to believe there was a performance issue on our web servers and focus much of our efforts there for the first few days. In addition to software improvements, the team allocated 66% more hardware capacity for our web servers, hoping this would buy us headroom to identify the true root cause, but to no avail. It was only after looking back at macro trends several days into the incident and seeing a very slight decrease in overall responsiveness and increased error rates that coincided with our load balancer changes that we began to investigate it as a potential root cause. Our load balancer employs several concurrency limits, and the addition of query load to it caused us to hit these limits during query peaks. We could clearly see this in concurrent connection metrics, but did not have monitoring for this scenario, nor did the load balancer log or otherwise indicate this was occurring. It would queue requests and silently incur delays while reporting nominal request and response times in its latency metrics. On 2022-09-15, Kentik Engineering removed the query load from this load balancer, and performance returned to consistently nominal levels. However, doing this rollback in conjunction with rapidly deploying new hardware for the web portal caused different bottlenecks in our query system during query peaks – ones that we were anticipating and trying to get ahead of by putting the load balancer in play in the first place. On 2022-09-21, Kentik Engineering was able to get all affected systems into a nominal state in terms of query performance and overall latency. **FOLLOW UP** The team is now focused on adding several layers of observability to our platform in order to improve our ability to respond to these types of incidents. In addition to more thorough monitoring of all components of our infrastructure, we are focused on identifying performance issues more proactively. During Q4 2022, our team will be working towards: * Adding more tracing to the Kentik Portal itself in order to get more visibility into browser-side/browser-observed performance * Leveraging Kentik Synthetics to actively monitor performance of key workflows in the Kentik Portal * Increasing our usage of Kentik Host Monitoring to more quickly identify performance issues via Kentik Alerting Please contact your Customer Success team or [[email protected]](mailto:[email protected]) if you have any further questions or concerns.
Status: Postmortem
Impact: Minor | Started At: Sept. 14, 2022, 5 p.m.
Description: **ROOT CAUSE** This incident was part of a series of incidents caused by bottlenecking in a load balancing system we placed in front of our query engine on 2022-09-01. This load balancer is shared across many of our underlying services, so many upstream Kentik portal pages were affected in different ways. The bottlenecking only occurred during peak query usage, at which time the load balancer would begin hitting its global connection limits. **RESOLUTION** Because this issue only occurred during our peak query times, it took us much longer than desired to identify the pattern and isolate a root cause. Each business day starting 2022-09-06, we would see elevated response times around the same time of day, but no obvious culprits based on metrics, logs, or traces. For the first few days, Kentik Engineering teams were identifying potential performance bottlenecks in various software services based on trace data, rolled out patches, and saw improved response times. While these changes did result in improved performance of various services, the observed improvement in response times immediately following patch deployments were false positives due to the patches rolling out during off peak hours and the root issue actually coinciding with our query peak. After hitting our query peaks on 2022-09-06 to 2022-09-08, we began to see the pattern emerge, but still could not clearly point at a root cause. The biggest blocker to identifying the root cause was that our load balancer was not reporting the bottlenecking in any fashion. In fact, when a Kentik Portal user loaded a page that ran a request that went through this load balancer, we would see nominal response times reported by the load balancer, but elevated response times reported by the web server. This led us to believe there was a performance issue on our web servers and focus much of our efforts there for the first few days. In addition to software improvements, the team allocated 66% more hardware capacity for our web servers, hoping this would buy us headroom to identify the true root cause, but to no avail. It was only after looking back at macro trends several days into the incident and seeing a very slight decrease in overall responsiveness and increased error rates that coincided with our load balancer changes that we began to investigate it as a potential root cause. Our load balancer employs several concurrency limits, and the addition of query load to it caused us to hit these limits during query peaks. We could clearly see this in concurrent connection metrics, but did not have monitoring for this scenario, nor did the load balancer log or otherwise indicate this was occurring. It would queue requests and silently incur delays while reporting nominal request and response times in its latency metrics. On 2022-09-15, Kentik Engineering removed the query load from this load balancer, and performance returned to consistently nominal levels. However, doing this rollback in conjunction with rapidly deploying new hardware for the web portal caused different bottlenecks in our query system during query peaks – ones that we were anticipating and trying to get ahead of by putting the load balancer in play in the first place. On 2022-09-21, Kentik Engineering was able to get all affected systems into a nominal state in terms of query performance and overall latency. **FOLLOW UP** The team is now focused on adding several layers of observability to our platform in order to improve our ability to respond to these types of incidents. In addition to more thorough monitoring of all components of our infrastructure, we are focused on identifying performance issues more proactively. During Q4 2022, our team will be working towards: * Adding more tracing to the Kentik Portal itself in order to get more visibility into browser-side/browser-observed performance * Leveraging Kentik Synthetics to actively monitor performance of key workflows in the Kentik Portal * Increasing our usage of Kentik Host Monitoring to more quickly identify performance issues via Kentik Alerting Please contact your Customer Success team or [[email protected]](mailto:[email protected]) if you have any further questions or concerns.
Status: Postmortem
Impact: Minor | Started At: Sept. 14, 2022, 5 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.