Last checked: 2 minutes ago
Get notified about any outages, downtime or incidents for ServiceChannel and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for ServiceChannel.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
WorkForce | Active |
Analytics | Active |
Analytics Dashboard | Active |
Analytics Download | Active |
Data Direct | Active |
API | Active |
API Response | Active |
Authentication | Active |
Budget Insights | Active |
SendXML | Active |
SFTP | Active |
Universal Connector | Active |
Mobile Applications | Active |
SC Mobile | Active |
SC Provider | Active |
Provider Automation | Active |
Fixxbook | Active |
Invoice Manager | Active |
IVR | Active |
Login | Active |
Proposal Manager | Active |
Work Order Manager | Active |
Service Automation | Active |
Asset Manager | Active |
Compliance Manager | Active |
Dashboard | Active |
Inventory Manager | Active |
Invoice Manager | Active |
Locations List | Active |
Login | Active |
Maps | Active |
Project Tracker | Active |
Proposal Manager | Active |
Supply Manager | Active |
Weather | Active |
Work Order Manager | Active |
Service Center | Active |
Email - servicechannel.com | Active |
Email - servicechannel.net | Active |
Phone - Inbound | Active |
Phone - Outbound | Active |
Third Party Components | Active |
Avalara Tax Calculation Service | Active |
Rackspace - Inbound Email | Active |
Twilio REST API | Active |
Zendesk | Active |
View the latest incidents for ServiceChannel and check for official updates:
Description: Please see the general postmortem at [https://status.servicechannel.com/incidents/cvp26brsbwl8](https://status.servicechannel.com/incidents/cvp26brsbwl8) for a comprehensive description of work to remediate platform performance issues.
Status: Postmortem
Impact: Critical | Started At: June 16, 2023, 1:21 p.m.
Description: Please see the general postmortem at [https://status.servicechannel.com/incidents/cvp26brsbwl8](https://status.servicechannel.com/incidents/cvp26brsbwl8) for a comprehensive description of work to remediate platform performance issues.
Status: Postmortem
Impact: Minor | Started At: June 12, 2023, 3:16 p.m.
Description: **Intermittent Performance Issues Under Normal Production Load - Incident Report** **Date of Incident:** 06/06/2023 - 06/30/2023 **Time/Date Incident Started:** 06/06/2023, 06:37 am EDT **Time/Date Stability Restored:** 06/28/2023, 10:00 pm EDT **Time/Date Incident Resolved:** 06/30/2023, 12:30 pm EDT **Users Impacted:** Many **Frequency:** Intermittent **Impact:** Major **Incident Description:** Support highlighted a system slowdown and degraded performance, impacting the Dashboard, Work Order operations, and Invoice reports. Despite our dedicated and consistent efforts to rectify these performance challenges, they persisted over several weeks before we successfully resolved the issues. **Root Cause Analysis:** A series of interrelated issues, typically manageable individually, collectively led to significant performance degradation during periods of increased production load. We initially struggled to identify the root causes due to their occurrence around the same time as unrelated infrastructure changes. The key issues included: * High numbers of Redis cache timeout events. * SQL timeouts in the application. * Multiple app server node failures requiring manual restarts. * Overuse of API calls due to a faulty third-party integration. **Redis Cache Timeouts:** We initially suspected the Redis cache timeouts were due to an upgrade from Redis v4 to Redis v6. However, after the timeouts persisted following a reversion to Redis v4, we discarded this theory. We traced the timeouts to a combination of connection thread exhaustion and misconfigured Redis connection timeout values. The application lacked a fallback mechanism for Redis object retrieval, causing failure instead of graceful data retrieval from the persistence layer. **Application SQL Timeouts:** Unpredictable application behavior stemmed from intermittent periods of SQL timeouts on application server nodes. The distribution of these errors across all server nodes indicated a non-application code issue. Our SRE and Application Engineering teams, working with our DBA team, traced the SQL timeout errors to long-running SQL queries on the database cluster. **Application Node Failures:** During this period, an unusually high number of application nodes failed, marked by increased response duration, maximum CPU utilization, and high memory usage. The SRE team discovered the issue stemmed from the routing algorithm, set to "LeastConnections". This algorithm led heavily loaded nodes to get locked into a high-load state, requiring manual intervention. **Excessive API Calls:** A Service Provider reported an unusually large number of Work Order schedule changes in a Work Order assigned to their organization. We traced these changes, which triggered a nuisance cycle of Work Order Notes and Notifications, to a faulty Subscriber-built integration. **Actions Taken:** During the investigation, our SRE and Application Engineering teams established a protocol for daily joint monitoring conferences. Key events are available in Appendix A. Key activities included: 1. SRE team monitoring logs for performance issue symptoms. 2. SRE team restarting web instances showing elevated response time. 3. DBA team investigating database anomalies. 4. Application Engineering team reviewing Redis configurations. 5. SRE team deactivating the faulty integration, modifying throttle limits, and engaging the responsible Subscriber. **Mitigation Measures:** 1. Redis Cache Timeouts: The Application Engineering team has implemented a shorter timeout threshold and fallback mechanism for Redis. 2. Redis Cache Timeouts: The Application Engineering and SRE teams are separating certain Redis application caches for better future performance. 3. Redis Cache Timeouts: SRE team has scaled up production Redis cache cluster nodes. 4. Application SQL Timeouts: team is systematically modifying stored procedures for improved transaction isolation through improved concurrency, thereby eliminating read blocking and subsequent SQL timeouts. 5. Application SQL Timeouts: Application Engineering team is implementing a systematic review of address transaction isolation levels implemented in stored procedures executed from code. 6. Application SQL Timeouts: When required, the DBA team will schedule Serializable transaction isolation queries during quiescent platform periods. 7. Application SQL Timeouts: DBA team has identified several stored procedures for future optimization. 8. Application SQL Timeouts: SRE team has implemented monitors to alert the DBA team about SQL timeout increases. 9. Application Node Failures: SRE team adjusted application configurations for optimal load balancing. by switching from “LeastConnections” to the "LeastResponseTime" algorithm, allowing nodes handling heavy tasks to finish before receiving additional work. 10. Application Node Failures: SRE team added monitors to identify application nodes trending toward failure. 11. Application Node Failures: The Application Engineering and SRE teams are improving internal health checks for deployed applications. 12. Application Node Failures: SRE team is developing functionality for automatic rebooting of failing application nodes. 13. Excessive API Calls: SRE team disabled a faulty Subscriber integration, communicated the issue to the Subscriber, and tightened the API throttle limit for the impacting integration. 14. Excessive API Calls: SRE team will monitor API usage trends more closely. 15. Excessive API Calls: The Architecture team will investigate alternative backpressure techniques for better platform scaling. 16. Excessive API Calls: Our teams are considering a formal process to evaluate and certify third-party integrations before implementation.
Status: Postmortem
Impact: Major | Started At: June 6, 2023, 10:37 a.m.
Description: This incident has been resolved. All services are working as expected.
Status: Resolved
Impact: Critical | Started At: May 8, 2023, 3:17 p.m.
Description: **Infrastructure/hardware instability** **Incident Report** **Date of Incident:**` `05/01/2023 **Time/Date Incident Started:** 05/01/2023, 5:00 pm EDT **Time/Date Stability Restored:**` `05/01/2023, 11:48 pm EDT **Time/Date Incident Resolved:**` `05/01/2023, 11:48pm EDT **Users Impacted:** All **Frequency:** Intermittent **Impact:** Major **Incident description:** Third party vendor infrastructure/hardware instability **Root Cause Analysis:** A third party vendor infrastructure issue affected performance and system availability for the underlying data storage layer servicing platform resources. **Actions Taken:** 1. Investigated system-generated alerts and identified affected platform functionality 2. SRE and DBA teams initiated a platform infrastructure redeployment, forcing the new infrastructure to spun up on unaffected infrastructure/hardware **Mitigation Measures:** 1. Continue the ongoing investigation into root causes of the infrastructure issue within our cloud hosting provider. 2. Continue to implement high availability improvements to prepare the platform to respond better to unexpected hardware issues that are beyond our control.
Status: Postmortem
Impact: Major | Started At: May 2, 2023, 1:21 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.