Last checked: 6 minutes ago
Get notified about any outages, downtime or incidents for ServiceChannel and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for ServiceChannel.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
WorkForce | Active |
Analytics | Active |
Analytics Dashboard | Active |
Analytics Download | Active |
Data Direct | Active |
API | Active |
API Response | Active |
Authentication | Active |
Budget Insights | Active |
SendXML | Active |
SFTP | Active |
Universal Connector | Active |
Mobile Applications | Active |
SC Mobile | Active |
SC Provider | Active |
Provider Automation | Active |
Fixxbook | Active |
Invoice Manager | Active |
IVR | Active |
Login | Active |
Proposal Manager | Active |
Work Order Manager | Active |
Service Automation | Active |
Asset Manager | Active |
Compliance Manager | Active |
Dashboard | Active |
Inventory Manager | Active |
Invoice Manager | Active |
Locations List | Active |
Login | Active |
Maps | Active |
Project Tracker | Active |
Proposal Manager | Active |
Supply Manager | Active |
Weather | Active |
Work Order Manager | Active |
Service Center | Active |
Email - servicechannel.com | Active |
Email - servicechannel.net | Active |
Phone - Inbound | Active |
Phone - Outbound | Active |
Third Party Components | Active |
Avalara Tax Calculation Service | Active |
Rackspace - Inbound Email | Active |
Twilio REST API | Active |
Zendesk | Active |
View the latest incidents for ServiceChannel and check for official updates:
Description: **US Production App Rollback Incident Report** **Date of Incident:** 08/09/2023 **Time/Date Incident Started:** 08/09/2023, 10:00 pm EDT **Time/Date Stability Restored:** 08/10/2023, 12:00 am EDT **Time/Date Incident Resolved:** 08/10/2023, 12:00 am EDT **Users Impacted:** All **Frequency:** Continuous **Impact:** Major **Incident description:** On 8/9/23, the production release of the US application code was rolled back following smoke testing and synthetic monitors that detected errors on the ServiceChannel platform. **Root Cause Analysis:** Upon investigation, it was determined that the cause of the issue could be traced back to a recent update in the platform session cookie. This update resulted in a malfunction of the Component module due to the module specifying an incorrect Redis store for session data. **Actions Taken:** 1. In response to the incident, the team promptly executed a rollback of the application services code to the previous functional version. After the rollback, the stability of the web platform was confirmed through both smoke testing and synthetic monitors. 2. To address the underlying problem, the Redis connection strings for the component modules were updated. The US Production release was re-deployed on 8/10/23 at 10 PM EDT with the correct configuration applied. **Mitigation Measures:** To prevent similar incidents in the future, the following mitigation measures will be implemented: 1. Ensuring Environment Consistency: A concerted effort will be made to better align production and non-production configurations. 2. Governance of Production Changes: To maintain greater control over potentially disruptive production changes, any changes that, due to scale considerations, can only be applied to the Production environment, will require explicit approval from senior management before implementation. 3. Monitoring Production-Only Variables: We will implement automated monitoring to to regularly check for the presence of "Production Only" configuration values. This practice will provide an additional layer of oversight and help prevent inadvertent changes.
Status: Postmortem
Impact: None | Started At: Aug. 10, 2023, 2 a.m.
Description: **Incident Report: Secondary Read Replica Unavailability and Application Degradation** **Date of Incident:** 08/04/2023 **Time/Date Incident Started:** 08/04/2023, 6:51 AM EDT **Time/Date Stability Restored:** 08/04/2023, 10:00 AM EDT **Time/Date Incident Resolved:** 08/04/2023, 10:45 AM EDT **Users Impacted:** All users **Frequency:** Sustained **Impact:** Major **Incident description:** On August 4th at 6:51 am EDT, a significant incident occurred as the secondary read replica became unavailable. This led to an increased load on the DB system, resulting in intermittent slowness that adversely affected a large number of users. The degraded application experience raised concerns and triggered immediate investigation and response. **Root Cause Analysis:** The incident was promptly addressed by the ServiceChannel SRE \(Site Reliability Engineering\) and DBA \(Database Admin\) teams following an automated alert triggered by an unhealthy state in the AG replication. Upon thorough investigation, the DEVOPS team meticulously reviewed all logs associated with August 4th within the AG replication timeframe. Their efforts unveiled a configuration modification of the system firewalls that coincided with a triggered restart of the database system. The SRE team effectively pinpointed this change within our configuration management systems, which inadvertently pushed through a firewall policy modification. Consequently, the modified database firewall settings obstructed traffic flow to the replica servers, initiating the incident. **Actions Taken:** 1. Immediate Alert Response: The DBA team swiftly responded by reviewing and promptly acknowledging the monitoring alerts associated with the impacted segment of the application. This proactive step ensured that the issue was promptly recognized and addressed. 2. Redeployment and Restart: In a concerted effort to restore system stability, the DBA team executed the strategic redeployment and thorough restart of both primary and secondary database replicas. This rigorous approach aimed to rectify the root cause of the incident and mitigate its impact on performance and availability. 3. Persistent Challenges: Despite the initial actions, the immediate system performance and availability concerns persisted, requiring a deeper investigation to uncover the underlying factors contributing to the incident's persistence. 4. Configuration Management Insights: A comprehensive analysis of our configuration management system logs revealed a crucial breakthrough. This investigation shed light on the unexpected enablement of system firewalls, which had previously gone unnoticed. This realization marked a pivotal turning point in our efforts to restore normalcy. 5. Rapid Firewall Disablement: Armed with the newfound understanding, the necessary steps were taken to promptly disable the system firewalls that were impeding traffic flow. This decisive action facilitated the gradual return of the system to its intended state, marking a definitive resolution to the incident. **Mitigation Measures:** In light of this incident, several proactive steps have been taken to mitigate the risk of similar occurrences: 1. Enhanced Monitoring: A robust monitoring system will be implemented to vigilantly track data-enabled functionality changes \(functionality feature switches\). This enhanced monitoring will promptly detect anomalies and potential performance issues, allowing for swift intervention. 2. Playbook Updates: The DBA and DEVOPS teams' troubleshooting playbook will be meticulously updated to incorporate the lessons learned from this incident. These revisions will streamline response procedures and ensure quicker, more effective resolution. 3. Code Review Process: The code review process has been revamped to include a meticulous assessment of dependencies in any configuration changes. This will mitigate unforeseen interactions and potential disruptions. 4. Conditional Logic Refinement: The SRE team has improved the conditional logic governing firewall settings, ensuring that they are enabled only when explicitly defined. This refinement adds an additional layer of control and security. 5. Continuous Enhancement: Our commitment to improvement remains steadfast. The ongoing development of tests and alerting systems will be a top priority, further enhancing our ability to detect and respond to data and configuration changes.
Status: Postmortem
Impact: Critical | Started At: Aug. 4, 2023, 1:19 p.m.
Description: **Date of Incident:** 07/10/2023 **Time/Date Incident Started:** 07/10/2023, 1:36 PM EDT **Time/Date Stability Restored:** 07/10/2023, 2:27 PM EDT **Time/Date Incident Resolved:** 07/10/2023, 2:53 AM EDT **Users Impacted:** All users **Frequency:** Sustained **Impact:** Major **Incident description:** On July 10th at approximately 1:36pm EDT, customers encountered significant slowness after logging into the platform. The slowness impacted a large number of users, leading to a suboptimal experience. **Root Cause Analysis:** The ServiceChannel SRE \(Site Reliability Engineering\) and DBA \(Database Admin\) teams responded to an automated alert triggered by high CPU usage on database read replicas. Upon investigation, the DBA team identified a new module and functionality that was executing excessively long queries against the read replicas. This new module was recently enabled for internal vendor logins. **Actions Taken:** 1. The SRE and DBA teams promptly reviewed and acknowledged monitoring alerts related to the affected part of the application. 2. The DBA and engineering teams collaborated to identify the root cause of the high loads, which was traced back to the newly enabled functionality for internal vendor logins. 3. To mitigate the issue, the DBA and engineering teams disabled the problematic functionality through a functionality feature switch. **Mitigation Measures:** 1. Improved monitoring of data-enabled functionality \(functionality feature switches\) to quickly detect anomalies and potential performance issues. 2. Implementation of a more aggressive graceful degradation approach, selectively disabling problematic functionality when high loads are detected to prevent widespread impact. 3. Continuous improvement of stress tests in lower environments to enhance the discovery of similar performance-related issues.
Status: Postmortem
Impact: Major | Started At: July 10, 2023, 5:46 p.m.
Description: **Date of Incident:** 07/04/2023 **Time/Date Incident Started:** 07/04/2023, 10:42 am EDT **Time/Date Stability Restored:** 07/04/2023, 10:51 am EDT **Time/Date Incident Resolved:** 07/04/2023, 12:48 pm EDT **Users Impacted:** All **Frequency:** Continuous **Impact:** Critical **Incident description:** A hardware fault affecting the server in the primary database cluster caused a brief loss of availability of the Primary Database Replica, and subsequent platform downtime, while the cluster healed itself. **Root Cause Analysis:** According to our cloud hosting partner, the server acting as the listener and primary node in the production database cluster suffered a critical hardware fault and went offline. A transient network issue introduced a brief delay in the failover mechanism, but all affected services recovered within a few minutes. **Actions Taken:** 1. Restarted the affected service to bring the failed node back online. 2. Monitored the impacted platform components to ensure application recovery. **Mitigation Measures:** 1. Redeployment of the impacted virtual machine took place during the 7/8/2023 planned maintenance window. 2. Continue the investigation with our cloud service provider to improve cluster recovery even during transient network events.
Status: Postmortem
Impact: Critical | Started At: July 4, 2023, 2:42 p.m.
Description: Please see the general postmortem at [https://status.servicechannel.com/incidents/cvp26brsbwl8](https://status.servicechannel.com/incidents/cvp26brsbwl8) for a comprehensive description of work to remediate platform performance issues.
Status: Postmortem
Impact: Critical | Started At: June 19, 2023, 9:04 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.