Last checked: 7 minutes ago
Get notified about any outages, downtime or incidents for SecureAuth and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for SecureAuth.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Operator Alerts | Active |
CIAM | Active |
Application | Active |
AU Region | Active |
EU Region | Active |
Push | Active |
SMS | Active |
US Region | Active |
Voice | Active |
Passwordless | Active |
Application | Active |
Device Trust | Active |
Mobile App | Active |
Push | Active |
SMS | Active |
Voice | Active |
SaaS/Full Cloud Components | Active |
SaaS/Full Cloud Identity Platform | Active |
SecureAuth Connector | Active |
SecureAuth Cloud Services | Active |
3rd Party Mobile Carrier | Active |
Enhanced Geolocation Resolution Service - US1 | Active |
Enhanced Geolocation Resolution Service - US2 | Active |
Fraud Service - US1 | Active |
Fraud Service - US2 | Active |
Geolocation Resolution Service - US1 | Active |
Geolocation Resolution Service - US2 | Active |
Nexmo Voice API | Active |
Push-to-Accept Service - US1 | Active |
Push-to-Accept Service - US2 | Active |
SMS Service - US1 | Active |
SMS Service - US2 | Active |
Telephony Extension/DTMF Service - US1 | Active |
Telephony Extension/DTMF Service - US2 | Active |
Telephony Provider SMS API | Active |
Telephony Service - US1 | Active |
Telephony Service - US2 | Active |
Threat Service - US1 | Active |
Threat Service - US2 | Active |
X.509 Certificate Service (SHA2) - US1 | Active |
X.509 Certificate Service (SHA2) - US2 | Active |
SecureAuth IdP Frontend Services | Active |
SecureAuth Application Templates | Active |
SecureAuth Web Admin | Active |
SecureAuth Polaris Services | Active |
FIDO Service | Active |
Mobile Services | Active |
Polaris Base Infrastructure | Active |
SaaS IdP Broker | Active |
SecureAuth Titan Services | Active |
Dashboard Service | Active |
Device Enrollment Service | Active |
Fraud Service | Active |
IP Blocking Service | Active |
IP Intel Service | Active |
IP Reporting Service | Active |
Licensing Service | Active |
Link-to-Accept Service | Active |
OAuth Service | Active |
Push-to-Accept Service | Active |
SMS Service | Active |
Symbol-to-Accept Service | Active |
Telephony Service | Active |
Titan Proxy Services (SA IdP 9.3 and older only) | Active |
Transaction Service | Active |
User Risk Scoring Service | Active |
User Risk Service | Active |
User Stats Service | Active |
Workforce | Active |
Certificate Enrollment | Active |
Cloud IdP | Active |
Dashboard | Active |
FIDO WebAuthn | Active |
Kerberos Authentication | Active |
Link-to-Accept Service | Active |
Mobile App | Active |
Push | Active |
SMS | Active |
Voice | Active |
View the latest incidents for SecureAuth and check for official updates:
Description: **RCA – EKS Outage - 09022023** **Leadership Response:** We apologize for the inconvenience and the difficulty your teams faced as you gave up your time with friends and family to communicate and resolve this incident with your internal users and customers. This is not an event we take lightly and had all hands-on deck to resolve the issue as quickly and efficiently as possible. Your experience with SecureAuth is very important to us. We value our partnership and the trust you continually put into our solution to protect your teams and your customers. We will continue to strive for excellence and make any changes necessary to deliver the stability and security that you require to be successful. **Incident Summary:** During a planned maintenance window between 06:00 and 12:00 UTC on September 2, 2023, a majority of SA IdP tenants on the EKS cluster failed resulting in an outage. All customers, cloud and hybrid deployments that use cloud services, were down or degraded during the incident. As a routine update of backend services was being performed, a networking component plugin failed its update, which caused the Vault service which stores application keys to fail. Since most production pods rely on the Vault service to obtain secrets, none of the production pods could come online. To resolve this, we reinstalled the networking component successfully, which allowed Vault to communicate with the rest of the system. Once this issue was resolved, there was an influx of back-logged communication to the backend database as all the production pods came back online. This overloaded the database connection pool causing additional bandwidth issues that impacted response times. To accelerate recovery of the entire environment, we temporarily reduced the number of active pods, which allowed the system to process the backlog. At approximately 15:50 UTC, all services were restored, and a postmortem of the incident began. **Root Cause:** * HashiCorp Vault failed to start after the EKS cluster update. This required manual intervention for VPC-CNI and CoreDNS add-ons. Vault is a critical dependency of many other cloud services to start. * Once Vault was operational, thousands of pods attempted to come online at once and many of them need to connect to one or more databases. The database servers became overwhelmed, preventing all services from coming back on-line. * A bug was discovered in the AMI used for production EKS worker nodes that causes auto-scaling of deployment replicas to grow to maximum capacity. This bug creates over-reporting of how much CPU is being used by each pod. This, in turn, generated about three times the normal pod count, further complicating recovery. **Resolution**: * Updated the VPC-CNI * Restarted CoreDNS and Vault * Throttled replication events to prevent overload on the databases * Deployed a workaround to the auto-scale bug to prevent overruns on connections **Corrective Actions:** * Instrumenting the upgrade process to detect CNI and Vault failures * Deploying future upgrades in isolated pod clusters to reduce impact to customers * Deploying a full resolution for the auto-scale bug * Completing any outstanding EKS upgrade tasks * Implementation of new communication protocol to inform customers of future incidents in a more timely and comprehensive manner
Status: Postmortem
Impact: Critical | Started At: Sept. 2, 2023, 11:50 a.m.
Description: **RCA – EKS Outage - 09022023** **Leadership Response:** We apologize for the inconvenience and the difficulty your teams faced as you gave up your time with friends and family to communicate and resolve this incident with your internal users and customers. This is not an event we take lightly and had all hands-on deck to resolve the issue as quickly and efficiently as possible. Your experience with SecureAuth is very important to us. We value our partnership and the trust you continually put into our solution to protect your teams and your customers. We will continue to strive for excellence and make any changes necessary to deliver the stability and security that you require to be successful. **Incident Summary:** During a planned maintenance window between 06:00 and 12:00 UTC on September 2, 2023, a majority of SA IdP tenants on the EKS cluster failed resulting in an outage. All customers, cloud and hybrid deployments that use cloud services, were down or degraded during the incident. As a routine update of backend services was being performed, a networking component plugin failed its update, which caused the Vault service which stores application keys to fail. Since most production pods rely on the Vault service to obtain secrets, none of the production pods could come online. To resolve this, we reinstalled the networking component successfully, which allowed Vault to communicate with the rest of the system. Once this issue was resolved, there was an influx of back-logged communication to the backend database as all the production pods came back online. This overloaded the database connection pool causing additional bandwidth issues that impacted response times. To accelerate recovery of the entire environment, we temporarily reduced the number of active pods, which allowed the system to process the backlog. At approximately 15:50 UTC, all services were restored, and a postmortem of the incident began. **Root Cause:** * HashiCorp Vault failed to start after the EKS cluster update. This required manual intervention for VPC-CNI and CoreDNS add-ons. Vault is a critical dependency of many other cloud services to start. * Once Vault was operational, thousands of pods attempted to come online at once and many of them need to connect to one or more databases. The database servers became overwhelmed, preventing all services from coming back on-line. * A bug was discovered in the AMI used for production EKS worker nodes that causes auto-scaling of deployment replicas to grow to maximum capacity. This bug creates over-reporting of how much CPU is being used by each pod. This, in turn, generated about three times the normal pod count, further complicating recovery. **Resolution**: * Updated the VPC-CNI * Restarted CoreDNS and Vault * Throttled replication events to prevent overload on the databases * Deployed a workaround to the auto-scale bug to prevent overruns on connections **Corrective Actions:** * Instrumenting the upgrade process to detect CNI and Vault failures * Deploying future upgrades in isolated pod clusters to reduce impact to customers * Deploying a full resolution for the auto-scale bug * Completing any outstanding EKS upgrade tasks * Implementation of new communication protocol to inform customers of future incidents in a more timely and comprehensive manner
Status: Postmortem
Impact: Critical | Started At: Sept. 2, 2023, 11:50 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: July 3, 2023, 9:09 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: July 3, 2023, 9:09 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: June 30, 2023, 2:42 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.