Last checked: a minute ago
Get notified about any outages, downtime or incidents for SimpliGov and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for SimpliGov.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
SendGrid API v3 | Active |
Preproduction | Active |
API | Active |
Authorization | Active |
Email Interaction | Active |
eSignature | Active |
Export | Active |
File Conversion | Active |
Metaquery | Active |
Portal | Active |
SimpliSign | Active |
Submission | Active |
Production | Active |
API | Active |
Authorization | Active |
Email Interaction | Active |
eSignature | Active |
Export | Active |
File Conversion | Active |
Metaquery | Active |
Portal | Active |
SimpliSign | Active |
Submission | Active |
Staging | Active |
API | Active |
Authorization | Active |
Email Interaction | Active |
eSignature | Active |
Export | Active |
File Conversion | Active |
Metaquery | Active |
Portal | Active |
SimpliSign | Active |
Submission | Active |
Training | Active |
API | Active |
Authorization | Active |
Email Interaction | Active |
eSignature | Active |
Export | Active |
File Conversion | Active |
Metaquery | Active |
Portal | Active |
SimpliSign | Active |
Submission | Active |
View the latest incidents for SimpliGov and check for official updates:
Description: **Preliminary Root Cause:** SimpliGov utilizes Azure Service Bus as it’s message queuing system for things like submissions, dashboard updates etc. Azure Government support informed SimpliGov that they performed updates to Azure Service Bus infrastructure and between 9:00 and 10:00 PST on 21-10-2021, SimpliGov was identified as a customer using Service Bus in USGov Arizona which experienced increased error rates for service bus. SimpliGov received “Connection reset by peer” error messages from the Azure Service bus service which resulted in partial connection loss to the service. Azure support states that the primary cause of the issue on their side was that a subset of backend instances experienced an unexpected high utilization due to a platform upgrade in US Gov Arizona. The Azure Government product engineering group allocated more bandwidth for the host nodes which brough the instances back to a healthy state. From the SimpliGov side, this mitigating action allowed our services to connect to Azure Service bus as expected. During this period of time, most intake submissions should have worked as expected but some users would have experienced slower than expected dashboard synchronization times and potential duplicates depending on their workflow configuration as submissions were retried when initial connections to service bus failed. **Mitigation:** In the immediate term, SimpliGov support employed manual dashboard synchronization processes to ensure that any records coming in were being reflected as soon as possible. Azure support resolved the Service Bus issue on their end and this allowed SimpliGov processing and dashboard synchronization services to function as expected. Going forward, SimpliGov has implemented several new features within our upcoming Dec 2021/Jan 2022 Production release allowing for better failover and fault tolerance in such scenarios. SimpliGov will be hosted on Azure Kubernetes Services instead of Azure Service Fabric, split service bus queue architecture will be enabled and additional retry policies have been added to handle scenarios whereby Azure Services return transient errors. These 3 particular items, in addition to smaller individual fixes and improvements should reduce the propencity of such incidents going forward. **Next Steps:** We apologize for the impact to affected customers. SimpliGov will continue to monitor the situation going forward with azure service bus and configure specific alerts relating to Service bus and “connection reset by peer' messages. We will be deploying additional updates / architectural changes as part of our upcoming production release scheduled for Jan 2022 to also improve future fault tolerance. All customer records being processed throughout the incident period should be consistent with their expected statuses. Customers do not need to take any reconciliatory actions in their production tenants unless you are directly notified to do so by the SimpliGov team. ------------------------------------------------------------------------------------------------------------------------------------------------------- _Additional RCA from Microsoft Azure Government support:_ **Incident Summary:** Between 10/21 17:07 UTC and 10/21 19:14 UTC, Event Hub, Service Bus, and Relay service customers may have intermittently experienced an increase in latency or timeouts on runtime operations. **Root Cause:** On 10/21/21 at 17:07 UTC, an upgrade was performed on clusters servicing Service Bus, Event Hub, and Relay services. During the upgrade, as TCP connections were disconnected from each gateway machine, there was a noticeable increase in TCP connection requests as existing TCP connections were terminated resulting in high CPU utilization by the LSASS process that handles TLS/SSL handshakes. As the LSASS process hit the CPU threshold set by the service teams, processing of TLS/SSL handshakes were slowed causing increased latency/delay in serving send and receive requests made by client applications. The issue was mitigated at 10/21/21 at 19:14 UTC after the team scaled out the cluster and added machines to allocate more resources to process incoming requests. **Next Steps:** Microsoft apologizes for the impact to affected customers. We have scaled out our gateway machines to increase capacity and are taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes \(but is not limited to\): * Enhancements to monitoring related to LSASS CPU utilization * Performance improvements to the throttle/delay response related to invalid tokens/SAS * Adjustment of diagnostic steps to include tracing namespace/audience when clients prevent invalid tokens/SAS
Status: Postmortem
Impact: Minor | Started At: Oct. 21, 2021, 7:55 p.m.
Description: **Preliminary Root Cause:** As a result of an upgrade activity initiated by the Azure Government team, Service Fabric clusters used to provide the production services failed to retrieve node statuses from the underlying virtual machine scale sets. The upgrade pushed by the Azure Government team was expected to apply the latest fabric updates to the cluster on each node sequentially, applying the update on a single seed node per fault domain within the Azure Service Fabric cluster. In cases of upgrade failure or node failure, it is expected that Azure Service Fabric will move services from affected nodes to active or "live" nodes. After applying the upgrade, a seed node required for the service fabric clusters failed to report their status to the Service Fabric management services and the Service Fabric cluster could not service any requests for SimpliGov’s Portal. Authentication, API, MetaQuery Sync and Submission services from this node. SimpliGov customers identified the downtime event by trying to access the SimpliGov portal via web browser and received a 502/504 status code page indicating that the Azure Application Gateway used for load balancing requests to SimpliGov Service Fabric clusters was not receiving appropriate health status responses to allow the Application Gateway to service requests to customers. **Mitigation:** Most services self-healed once the Service Fabric failover manager failed over services from the affected node to an active node. However, some dashboard synchronization services were affected post-healing event. In the immediate term, SimpliGov restarted the affected node containing dashboard synchronizatin services in the production virtual machine scale set, triggering move procedures for services originally hosted on the affected node. At 12:30 PM PST, the move activity completed and normal service was removed. In addition to working to restore services as soon as possible, SimpliGov contacted the Azure Government support team to request assistance with the issue and a full root cause analysis on why the upgrades failed to apply correctly and through each fault domain as expected. Information was also requested on how the upgrade caused all nodes in the underlying Virtual Machine Scale Sets to fail to report appropriate statuses to the Service Fabric management services. **Next Steps:** We apologize for the impact to affected customers. Azure Government will provide additional details on why the upgrade process didn’t apply in the expected manner, caused the nodes to fail to report to the Service Fabric management service and why the node failure did not trigger expected move processes without manual interaction from SimpliGov. All customer records being processed throughout the incident period should be consistent with their expected statuses. Customers do not need to take any reconciliatory actions in their production tenants unless you are directly notified to do so by the SimpliGov team.
Status: Postmortem
Impact: Minor | Started At: Oct. 20, 2021, 7:32 p.m.
Description: **Executive Summary:** Between 5:19 AM PST and 7:54 AM PST, SimpliGov preproduction was unavailable for customers causing users to receive pages stating "502" errors. As a result of this outage, users of affected customer websites were unable to submit new forms or work with existing forms in the preproduction environment. The cause of this partial outage was that periodic cluster updates failed to apply correctly and automated rollback processes on the cluster stalled. At 7:54 AM PST, normal service was resumed for the preproduction environment. After working with our hosting provider \(Azure Government\), it was determined that the best course of action was to redeploy the cluster and the services running on it. As a result of redeploying the cluster, the IP address for preproduction changed from 52.244.80.86 to 52.244.79.177. Customers should replace the old IP address \(52.244.80.86\) with the new IP address \(52.244.79.177\) if they are whitelisting communication from SimpliGov preproduction to internal systems. We apologize for the impact to affected customers. A more detailed summary of events can be seen below. **Detailed Summary:** Preliminary Root Cause: As a result of a long running upgrade activity on our preproduction Service Fabric cluster, Service Fabric initiated automatic rollback procedures to revert the updates applied and move back to the last successful version applied. The rollback activity noted ran for much longer than expected and was deemed "stalled". In the aforementioned scenario, the service fabric cluster could not service any requests for SimpliGov’s Portal. Authentication, API, MetaQuery Sync and Submission services and remained in a stalled status. SimpliGov users trying to access the preproduction via web browser received 502 status code pages indicating that the Azure Application Gateway used for load balancing requests to SimpliGov Service Fabric clusters was not receiving appropriate health status responses to allow the Application Gateway to service requests to customers. Likewise, API calls to preproduction received a 502 status code. Mitigation: After discussing available options with the Azure Government support team, SimpliGov redeployed the cluster affected and subsequentially redeployed all preproduction services affected. In addition to working to restore services as soon as possible, the Azure Government support team are working to identify why the upgrades and rollback processes failed. As a result of this mitigation strategy, customers should update any IP whitelists they have for non-production environments by replacing Preproduction's old IP address \(52.244.80.86\) with the new Preproduction IP address \(52.244.79.177\) Next Steps: We apologize for the impact to affected customers. Azure Government will provide additional details on why the upgrade and rollback processes didn’t work as expected, causing preproduction to become unavailable. Note that as this event occurred on preproduction with submission, API and portal services being unavailable, all customer records being processed throughout the downtime event should be consistent with their expected statuses. Customers do not need to take any reconciliatory actions in their preproduction tenants.
Status: Postmortem
Impact: Major | Started At: Aug. 5, 2021, 12:19 p.m.
Description: **Preliminary Root Cause:** As a result of an upgrade activity initiated by the Azure Government team, Service Fabric clusters used to provide the production services failed to retrieve node statuses from the underlying virtual machine scale sets. The upgrade pushed by the Azure Government team was expected to apply the latest fabric updates to the cluster on each node sequentially, applying the update on a single seed node per fault domain within the Azure Service Fabric cluster. In cases of upgrade failure or node failure, it is expected that Azure Service Fabric will move services from affected nodes to active or "live" nodes. After applying the upgrade, a seed node required for the service fabric clusters failed to report their status to the Service Fabric management services and the Service Fabric cluster could not service any requests for SimpliGov’s Portal. Authentication, API, MetaQuery Sync and Submission services from this node. SimpliGov customers identified the downtime event by trying to access the SimpliGov portal via web browser and received a 502 status code page indicating that the Azure Application Gateway used for load balancing requests to SimpliGov Service Fabric clusters was not receiving appropriate health status responses to allow the Application Gateway to service requests to customers. **Mitigation:** In the immediate term, SimpliGov restarted the affected node in the production virtual machine scale set, triggering move procedures for services originally hosted on the affected node. At 8:44 AM PST, the move activity completed and normal service was removed. In addition to working to restore services as soon as possible, SimpliGov contacted the Azure Government support team to request assistance with the issue and a full root cause analysis on why the upgrades failed to apply correctly and through each fault domain as expected. Information was also requested on how the upgrade caused all nodes in the underlying Virtual Machine Scale Sets to fail to report appropriate statuses to the Service Fabric management services. Azure Government support also confirmed the successful application of upgrades after the actions taken by SimpliGov and the movement of services from the affected node to other live nodes was confirmed. **Next Steps:** We apologize for the impact to affected customers. Azure Government will provide additional details on why the upgrade process didn’t apply in the expected manner, caused the nodes to fail to report to the Service Fabric management service and why the node failure did not trigger expected move processes without manual interaction from SimpliGov. Note that as this event occurred on production with submission, API and portal services being unavailable, all customer records being processed throughout the downtime event should be consistent with their expected statuses. Customers do not need to take any reconciliatory actions in their production tenants.
Status: Postmortem
Impact: Major | Started At: Aug. 4, 2021, 3:30 p.m.
Description: The incident has been resolved
Status: Resolved
Impact: Minor | Started At: Aug. 3, 2021, 6:57 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.