Last checked: 9 seconds ago
Get notified about any outages, downtime or incidents for Frontegg and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Frontegg.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Audit logs | Active |
Entitlements | Active |
Machine to machine authentication | Active |
Management portal | Active |
Reporting | Active |
SSO & SAML authentication | Active |
User authentication | Active |
Webhooks infrastucture | Active |
View the latest incidents for Frontegg and check for official updates:
Description: ## **Executive summary:** On June 3rd, at 12:06 GMT, The Frontegg team received an indication from our monitoring system of increased latency for refresh token requests \(average greater than 750 ms\) in our US region. Starting at 12:12 GMT, the first customer reached out to frontegg noting request timeouts. At 12:13 GMT, we updated our status page and officially began the investigation. As a preliminary measure the team began a number of different mitigation actions in an attempt to remedy the situation as quickly as possible. After seeing no improvement, at 12:30 GMT the team began a full cross-regional disaster recovery protocol. At 12:40 GMT we also began a same-region disaster recovery protocol \(starting a new same-region cluster\) as part of the escalation to ensure a successful recovery. At 13:25 GMT we began to divert the traffic to the new same-region cluster and by 13:30 we saw a stabilization of traffic to Frontegg. Upon further investigation, we discovered the root cause to be a networking issue inside our main cluster which caused a chain reaction affecting the general latency of the cluster. Additionally we are working with our cloud provider to gather additional details on the event from their side. ## **Effect:** From 12:06 GMT to 13:30 GMT on June 3rd, Frontegg accounts hosted in our US region experienced a substantial latency to a significant part of identity-based requests on Frontegg. This meant many requests were timed out, causing users to be unable to login or refresh their tokens. Additionally, access to the Frontegg Portal was partially blocked due to this issue. ## **Mitigation and resolution:** Once the Frontegg team received the initial alert to refresh latency, we began an investigation into our traffic, request latency, workload, hanging requests, and database latency. Upon finding inconclusive results, the team initiated a handful of mitigation efforts, such as: * At 12:14 GMT, we increased our cluster workload. * At 12:30 GMT the team began a full cross-regional disaster recovery protocol. * At 12:40 GMT we also began a same-region disaster recovery protocol \(starting a new same-region cluster\) as * By 13:00 GMT, we increased the number of Kafka brokers as an additional measure for mitigation. After a preliminary check on the new same-region cluster we began diverting traffic to the new cluster. By 13:30 GMT we saw a stabilization of traffic to this cluster and moved the incident to monitoring. We continued to monitor traffic for the next hour before resolving the incident. ## **Preventive steps:** * We are adding a same-region hot failover cluster for quick mitigation of P0 issues * We are fine-graining our rate limits on all routes within the system to add additional protection to our cluster health * We are working closely with our cloud provider to gather additional information on the event in order to increase the predictability of future events At Frontegg, we take any downtime incident very seriously. We understand that Frontegg is an essential service, and when we are down, our customers are down. To prevent further incidents, Frontegg is focusing all efforts on a zero-downtime delivery model. We apologize for any issues caused by this incident.
Status: Postmortem
Impact: Major | Started At: June 3, 2024, 12:13 p.m.
Description: ## **Executive summary:** On June 3rd, at 12:06 GMT, The Frontegg team received an indication from our monitoring system of increased latency for refresh token requests \(average greater than 750 ms\) in our US region. Starting at 12:12 GMT, the first customer reached out to frontegg noting request timeouts. At 12:13 GMT, we updated our status page and officially began the investigation. As a preliminary measure the team began a number of different mitigation actions in an attempt to remedy the situation as quickly as possible. After seeing no improvement, at 12:30 GMT the team began a full cross-regional disaster recovery protocol. At 12:40 GMT we also began a same-region disaster recovery protocol \(starting a new same-region cluster\) as part of the escalation to ensure a successful recovery. At 13:25 GMT we began to divert the traffic to the new same-region cluster and by 13:30 we saw a stabilization of traffic to Frontegg. Upon further investigation, we discovered the root cause to be a networking issue inside our main cluster which caused a chain reaction affecting the general latency of the cluster. Additionally we are working with our cloud provider to gather additional details on the event from their side. ## **Effect:** From 12:06 GMT to 13:30 GMT on June 3rd, Frontegg accounts hosted in our US region experienced a substantial latency to a significant part of identity-based requests on Frontegg. This meant many requests were timed out, causing users to be unable to login or refresh their tokens. Additionally, access to the Frontegg Portal was partially blocked due to this issue. ## **Mitigation and resolution:** Once the Frontegg team received the initial alert to refresh latency, we began an investigation into our traffic, request latency, workload, hanging requests, and database latency. Upon finding inconclusive results, the team initiated a handful of mitigation efforts, such as: * At 12:14 GMT, we increased our cluster workload. * At 12:30 GMT the team began a full cross-regional disaster recovery protocol. * At 12:40 GMT we also began a same-region disaster recovery protocol \(starting a new same-region cluster\) as * By 13:00 GMT, we increased the number of Kafka brokers as an additional measure for mitigation. After a preliminary check on the new same-region cluster we began diverting traffic to the new cluster. By 13:30 GMT we saw a stabilization of traffic to this cluster and moved the incident to monitoring. We continued to monitor traffic for the next hour before resolving the incident. ## **Preventive steps:** * We are adding a same-region hot failover cluster for quick mitigation of P0 issues * We are fine-graining our rate limits on all routes within the system to add additional protection to our cluster health * We are working closely with our cloud provider to gather additional information on the event in order to increase the predictability of future events At Frontegg, we take any downtime incident very seriously. We understand that Frontegg is an essential service, and when we are down, our customers are down. To prevent further incidents, Frontegg is focusing all efforts on a zero-downtime delivery model. We apologize for any issues caused by this incident.
Status: Postmortem
Impact: Major | Started At: June 3, 2024, 12:13 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: May 28, 2024, 1:27 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: None | Started At: May 28, 2024, 10:33 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: None | Started At: May 28, 2024, 10:33 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.