Last checked: 6 minutes ago
Get notified about any outages, downtime or incidents for ServiceChannel and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for ServiceChannel.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
WorkForce | Active |
Analytics | Active |
Analytics Dashboard | Active |
Analytics Download | Active |
Data Direct | Active |
API | Active |
API Response | Active |
Authentication | Active |
Budget Insights | Active |
SendXML | Active |
SFTP | Active |
Universal Connector | Active |
Mobile Applications | Active |
SC Mobile | Active |
SC Provider | Active |
Provider Automation | Active |
Fixxbook | Active |
Invoice Manager | Active |
IVR | Active |
Login | Active |
Proposal Manager | Active |
Work Order Manager | Active |
Service Automation | Active |
Asset Manager | Active |
Compliance Manager | Active |
Dashboard | Active |
Inventory Manager | Active |
Invoice Manager | Active |
Locations List | Active |
Login | Active |
Maps | Active |
Project Tracker | Active |
Proposal Manager | Active |
Supply Manager | Active |
Weather | Active |
Work Order Manager | Active |
Service Center | Active |
Email - servicechannel.com | Active |
Email - servicechannel.net | Active |
Phone - Inbound | Active |
Phone - Outbound | Active |
Third Party Components | Active |
Avalara Tax Calculation Service | Active |
Rackspace - Inbound Email | Active |
Twilio REST API | Active |
Zendesk | Active |
View the latest incidents for ServiceChannel and check for official updates:
Description: **Cloud Provider Network Outage - Incident Report** **Date of Incident:** 01/25/2023 **Time/Date Incident Started:** 01/25/2023, 3:30 am EST **Time/Date Stability Restored:** 01/25/2023, 6:30 am EST **Time/Date Incident Resolved:** 01/25/2023, 6:30 am EST **Users Impacted:** All **Frequency:** Intermittent **Impact:** Major **Incident description:** Azure Networking Errors - Multiple regions **Summary of Impact:** Between 07:05 UTC and 09:45 UTC on 25 January 2023, customers experienced issues with networking connectivity, manifesting as network latency and/or timeouts when attempting to connect to Azure resources in Public Azure regions, as well as other Microsoft services including M365 and PowerBI. **Preliminary Root Cause Analysis:** Microsoft Azure, our primary Cloud Service Provider, determined that a change made to the Microsoft Wide Area Network \(WAN\) impacted connectivity between customers on the internet to Azure, connectivity between services within regions, as well as ExpressRoute connections. This was a global outage for all Microsoft Azure customers. The ServiceChannel SRE team determined that users outside North America encountered issues reaching our EU hosted applications. Users located in the EU encountered issues connecting to the US hosted applications. After our cloud provider rolled back Wide Area Network \(WAN\) changes, network access between regions was restored for all ServiceChannel users. **Actions Taken:** 1. SRE team investigated triggered platform alerts for our European datacenter. 2. Reviewed status page for hosting partner. **Mitigation Measures:** Our cloud provider identified a recent change to WAN as the underlying cause and have rolled back this change. To the end, they have offered the following mitigations to prevent recurrence: 1. Blocking highly impactful commands from getting executed on network devices \(Completed\) 2. Requiring that all command execution on the devices to follow safe change guidelines \(Estimated completion: February 2023\) Cloud Provider RCA \(requires Microsoft Azure account\): [https://app.azure.com/h/VSG1-B90/05a585](https://app.azure.com/h/VSG1-B90/05a585)
Status: Postmortem
Impact: Major | Started At: Jan. 25, 2023, 9:25 a.m.
Description: **Execute or Insert permission was denied against DB objects errors - Incident Report** **Date of Incident:** 01/09/2023 **Time/Date Incident Started:** 01/09/2023, 10:19 am EDT **Time/Date Stability Restored:** 01/09/2023, 12:30 pm EDT **Time/Date Incident Resolved:** 01/09/2023, 12:30 pm EDT **Users Impacted:** Few **Frequency:** Intermittent **Impact:** Major **Incident description:** A small number of users encountered “Execute permission was denied against Database objects” or random time-out errors. **Root Cause Analysis:** DBA \(Database Administration\) and SRE \(Site Reliability Engineering\) teams responded to reports of random errors or timeout issues as reported to the Servicechannel support teams. While conducting a deep dive into application logs, SRE team identified a pattern of errors all being generated against a single instance of the serviceclick pool. Furthermore, our application logs identified this instance came online exact when the errors started registering. This new instance was automatically added as previously defined by scale-out rules that take into consideration the existing demands of the system and are then removed when no longer required. SRE team pulled the logs for the bad instance and opened a cloud provider support case and shortly after manually removed the unhealthy instance. **Actions Taken:** 1. Database team attempted to resolve the execute permission errors by providing the required permissions to the tables. 2. SRE team reviewed logs and found a specific instance that was generating all errors. 3. SRE team pulled logs for that unhealthy node and opened a support case with our cloud provider Support to assist with the investigations. 4. SRE team stopped the unhealthy instance via the Cloud Provider Support Rest API. 5. SRE team engaged engineering team to perform a deep dive on logging, health-checks, database configuration and credentials storage. **Mitigation Measures:** 1. Added alerts that fire on insert permissions errors and names the specific instance. 2. The engineering team will add more logging timestamps to ensure proper timestamps are tied to the application. 3. Engineering team will review web application instance health-check to ensure they are working as intended. 4. Cloud provider support to help explain why a single instance exhibited behavior that was different from all other nodes in the pool.
Status: Postmortem
Impact: Minor | Started At: Jan. 9, 2023, 2:55 p.m.
Description: # **Database VM \(Virtual Machine\) failures -** **Incident Report** **Date of Incident:** 01/06/2023 **Time/Date Incident Started:** 01/06/2023, 10:19 am EDT **Time/Date Stability Restored:** 01/06/2023, 12:30 pm EDT **Time/Date Incident Resolved:** 01/06/2023, 12:30 pm EDT **Users Impacted:** Few **Frequency:** Intermittent **Impact:** Major **Incident description:** Automated alerting for Database virtual machines triggered suddenly which lead to failed health checks and VM’s defined as unhealthy. This degradation resulted in performance issues for the ServiceChannel platform users. **Root Cause Analysis:** Early in the troubleshooting process, SRE \(Site Reliability Engineering\) and DBA \(Database Administration\) teams identified one of the VM instances suffered a loss of network connectivity which resulted in the instance being marked as unhealthy. SRE team proceeded with redeploying this VM which served as one of the replica virtual machine for the database cluster. 15 minutes into the redeploy, SRE team determined that a second replica server registered as unhealthy. SRE team decided to redeploy the second virtual machine. The redeploy process involves migrating the virtual machines onto new host hardware, once the redeployment was completed, DBA team proceeded with ensuring the replica servers were fully in-synch and that load was balanced properly between the servers. **Actions Taken:** 1. Investigated triggered alerts and identified degraded virtual machines . 2. SRE team triggered a VM redeploy on both replica database servers onto new underlying hardware. **Mitigation Measures:** 1. SRE team opened an Azure support case for additional assistance with investigating the root cause of the virtual machine failures. 2. SRE and DBA teams has started efforts to enhance high availability and disaster recovery for existing and future database server implementations.
Status: Postmortem
Impact: Major | Started At: Jan. 6, 2023, 4:11 p.m.
Description: The provider has indicated that resolution steps have been implemented and we expect that all Contact Center services are operational. We consider this issue resolved.
Status: Resolved
Impact: Minor | Started At: Jan. 4, 2023, 7:48 p.m.
Description: **Database Replication Latency and Messaging queue bug resulting in delays with processing system messages** **Date of Incident:** 12/23/2022 **Time/Date Incident Started:** 12/23/2022, 02:27 pm EST **Time/Date Stability Restored:** 12/24/2022, 03:20 am EST **Time/Date Incident Resolved:** 12/24/2022, 10:15 am EST **Users Impacted:** Some clients **Frequency:** Intermittent **Impact:** Major **Incident description:** 1. Database Replication latency resulted in event transactions where some messages were blocked, this lead to processing delays. 2. A small number of bad messages that exceeded body size limits were unable to properly age out of the system where the retry process created duplicated bad messages. **Root Cause Analysis:** the SRE team, along with the DBA team had just completed responding to a production database issue that resulted in data replication latency. This type of event typically does not result in a production outage issue as replication is quickly synced back together once the replication servers are in a healthy state. One of the side effects of replication latency is the backlog of system event messages which triggered internal monitors for queue message thresholds. This issue will typically resolve itself once the backlog of messages are processed by the system. SRE team observed that messages continued to remain unprocessed and restarted the worker services responsible for processing the system event messages. When this did not resolve the issue, the SRE team reviewed logs and engaged our software engineering teams. After conducting a joint deep dive into this issue, we were able confirm that new messages arriving in queue were successfully being processed. However, our software engineers identified a previously undiscovered bug on the emitter side of the events system, where if a “WorkOrderCreated” event had a body size larger than 256KB, this would result in a rejection and subsequently resulted in a crash of the queueing service which also kept these specific events marked as not processed, from this point the emitter would start to create duplicate events that were also not being processed causing a loop effect. By early morning, the teams were able to identify and mark the affected messages for deletion, this update allowed the duplicate messages to slowly age out of the system. **Actions Taken:** 1. Monitored FIFO queue and restarted notification services. 2. Increased instance counts for win services responsible for the HttpEndpointNotificationHandler. 3. Restarted application servers. 4. Monitored logs to confirm and confirmed that new system event messages were being processed. 5. Created test messages to confirm status were being updated properly. 6. Identified and deleted messages with body size limits. 7. Increased workers to process outstanding events. 8. Identified messages that exceeded body size of 256KB. 9. Marked duplicate message for deletion. 10. Disabled retry attempts for duplicate messages. **Mitigation Measures:** 1. Engineering team identified a bug with the message size limitation and will add proper validation for this message size limitation. 2. Engineering team will improve worker agent scaling to handle increase message loads.
Status: Postmortem
Impact: Minor | Started At: Dec. 30, 2022, 3:07 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.