Last checked: 8 minutes ago
Get notified about any outages, downtime or incidents for Firstup and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Firstup.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|
View the latest incidents for Firstup and check for official updates:
Description: ## Summary: On Wednesday, November 13th, 2024, starting at around 6:14 AM PT, we received reports of some users experiencing a general slowness and decreased responsiveness while navigating around Firstup’s Studio platform. Additional reports of some published campaign emails not being delivered, and error messages being returned while navigating within the Employee Experience followed shortly thereafter. ## Severity Sev2 ## Scope: The scope of this service degradation included all Firstup customers utilizing the Studio and the Employee Experience \(EE\) endpoints. ## Impact: For approximately 2.5 hours after the onset of the incident, users were unable to efficiently and consistently navigate around Studio and the EE. Symptoms included a loading spinner for an extended amount of time in Studio, and missing shortcuts in EE, with some of the errors noted below being returned: * Oops, an error occurred! Sorry about that. * TypeError: toMoment\(…\).format is not a function Concurrently, campaign emails were also not being delivered consistently to their intended recipients, and user sync files were not being actively processed. Most campaign emails were delivered over the course of the next 3 hours with a few rare exceptions that required manual intervention and Customer coordination. User sync was not restored until it was discovered the process was not actively running the following day. Total incident duration less the user sync process was just over 6 hours. ## Root Cause: The root cause has been attributed to a code change that was released during the Platform Software Release maintenance window the night before, which caused a slow-running query from one of our back-end services to run for too long and utilize an exceptionally large amount of database resources including the number of network connections and CPU processing cycles. This, coupled with the normal increase in database requests from our customer base during our platform utilization peak hours starting at around 5:54 AM PT, resulted in available database connections to be exhausted and CPU overutilization conditions in the database. As a result, new connections to the database could not be freely established until current connections were closed and made available for new requests from various platform services such as Studio and EE. The backend service responsible for campaign email delivery was also subject to this condition and could not process email deliveries as expected. Similarly, user sync file processing was also delayed beyond normal limits. ## Mitigation: Various symptoms exhibited during the course of the incident were mitigated in phases. The most significant service impact was mitigated after a hotfix was released at 8:29 AM PT to halt the aforementioned slow-running query, in effect relieving some resource pressure on the database, and allowing customer-facing service requests from Studio and EE to successfully re-establish connections with the database. Additional resources were also spun up to process the email delivery queue backlog that had been increasing during the incident, which started draining at 12:02 PM PT. Almost all campaign emails were confirmed delivered by 1:10 PM PT, with the exception of ~150 email messages, which had experienced an internal error, but were later cleared and delivered. ## Recurrence Prevention: The following actions have been taken or have been identified as follow-up actions to commit to as a part of the formal RCA \(Root Cause Assessment\) process: * Moved the slow-running query to run during after-hours \(off customer peak hours\). * Isolate the campaign email delivery back-end service to its own database cluster to avoid the general database impact on email deliveries. * Enhance our software change management policies to release risky backend changes behind a feature flag for a more controlled release. * Update post incident service verification to ensure that user sync processing has been fully restored and remains functional.
Status: Postmortem
Impact: None | Started At: Nov. 13, 2024, 3:07 p.m.
Description: **Summary:** On Monday, November 4th, 2024, starting at around 7:53 AM PT, we received reports that some published campaign emails were not being delivered to their intended audiences. While some emails were delivered as expected, others were either delayed or appeared to be stuck. All emails were handed off in a timely manner from the Firstup platform to the third-party email provider. This problem worsened over the course of the next several hours where email throughput appeared to be highly restricted, while the backlog of the email delivery queue not only continued to grow, but also did not drain in a logical chronological order in which messages were initially queued. Through a joint troubleshooting call with the third-party email provider, it was determined that a large volume of email delivery errors starting at around 7:11 AM PT had put the entire pool of our email delivery IP addresses in a state of reduced performance. After jointly reviewing the highest volume of errors with the third-party email provider, the sender IPs were restored to a fully functioning state, resulting in the entire email backlog being drained fully by 3:30 PM PT. **Severity:** Sev2 **Scope:** The scope of this service degradation was restricted to customers who use Firstup campaign email delivery as a channel, as well as any other non-campaign email content sent from the Firstup platform, such as password reset request emails. Push notifications, assistant notifications, and web or mobile experience channels were unaffected and remained fully functional. **Impact:** Within an individual campaign sent to email as a channel, some emails may not have been delivered as expected, while others wound up being stuck in a "processing" state on the third- party email delivery platform. During the incident \(7hrs 58mins\), some of the emails in the “processing” state were successfully delivered but heavily delayed, while others remained stuck in a “processing” status. Observed email throughput was reduced to approximately 46k messages per hour from a theoretical max of 30k messages per second. The total outstanding backlog prior to mitigation was over a million email messages. **Root Cause:** Root cause has been attributed to an elevated level of email delivery errors that triggered a protection mechanism on the third-party email provider platform. This resulted in reduced throughput for the entire pool of our sender IP addresses to the point where mostly retries for deferrals from earlier delivery errors were being processed, and very few queued up emails were delivered. Essentially the queue processing equivalent of running in place. **Mitigation:** After analyzing the top contributors to email delivery errors that appeared to be correlated to a single misconfigured email security endpoint all addresses associated with that endpoint were force-unsubscribed until it could be correctly configured, to avoid any further email delivery errors contributing to the underlying log jam. 80k email errors were attributed to that endpoint in just a couple of hours. Through a joint incident bridge with the third-party email provider, Firstup demonstrated the irrelevance of the deferral rates to the overall email backlog queue. A data pipeline engineer was paged out and able to verify that the sender IPs had been relegated to a lower performing state that was actually contributing to a circular problem. Backend system changes were made at 3:09 PM PT on the third-party platform to restore prior state of the sender IPs, and the entire email messages backlog subsequently fully drained in less than 25 minutes. **Recurrence Prevention:** The following actions have been taken or have been identified as follow-up actions to commit to as a part of the formal RCA \(Root Cause Assessment\) process: * Email addresses contributing to elevated error rates will be bulk unsubscribed from the platform \(or otherwise quarantined\) until underlying conditions can be corrected. * Coordinate with third-party provider to better understand the characteristics of the platform safety mechanism, including why it triggered, how to avoid it entirely, and how to improve joint monitoring and mitigate elevated error rates from affecting overall delivery. * Implement any reasonable recommendations from the third-party RFO.
Status: Postmortem
Impact: None | Started At: Nov. 4, 2024, 4:40 p.m.
Description: **Summary:** On Monday, October 14th, 2024, starting at 9:41 AM PDT, we received reports of Studio users receiving the error message “We’re sorry, but something went wrong” while attempting to log into Studio via Single Sign-On \(SSO\). Following a correlation of customer reports and initial troubleshooting, a platform service disruption incident was declared at 11:08 AM PDT and published on our Status Page at 11:11 AM PDT. **Severity:** Sev2 **Scope:** The scope of this service disruption was restricted to Studio users on the US platform attempting to log into Studio via SSO. Users who were already logged in before the incident, or used other authentication methods to log into Studio were unaffected. **Impact:** Users could not log into Studio via SSO for the duration of this incident \(1hr 37mins\). **Root Cause:** The root cause of this incident was attributed to an unexpected hardware failure on the AWS Redis cluster, which triggered a failover event at 9:35 AM PDT. The failover event caused disruptions to the authentication flow in the Identity and Access Management \(IAM\) Redis service, which did not re-establish connections to the failover cluster, leading to the SSO login error. **Mitigation:** To mitigate this incident, the IAM service was restarted at 11:12 AM PDT to refresh the connections to the failover Redis cluster, which restored the Studio SSO logging service. **Recurrence Prevention:** To prevent this incident from recurring, we will perform the following: * Introduced self-healing for IAM to automatically reconnect to Redis following failover events. This enhancement will be released in our upcoming Scheduled Software Release maintenance window on November 12th, 2024. * Perform a gap analysis of the already existing IAM monitoring and alerting dashboard.
Status: Postmortem
Impact: None | Started At: Oct. 14, 2024, 6:11 p.m.
Description: ## Summary: On September 30th, 2024, beginning at approximately 1:24 PM PDT \(20:24 UTC\), we started receiving reports of Shortcuts intermittently being unavailable and the Assistant returning an error in the Employee Experience. A platform incident was declared at 2:36 PM PDT \(21:36 UTC\) after initial investigations revealed the issue to be platform-wide. ## Severity: Sev2 ## Scope: Any user on the US platform accessing the Web or Mobile Experiences intermittently experienced missing Shortcuts and/or received an error message while accessing the Assistant. A refresh of the Employee Experience page occasionally restored these endpoints. All other services in the Employee Experience remained available and functional. ## Impact: Shortcuts and the Assistant endpoints in the Employee Experience were intermittently unavailable during the incident. ## Root Cause: The root cause was determined to be due to an uncharacteristically high number of new user integrations introduced within a short period of time that exacerbated a newly uncovered non-optimized content caching behavior. This caused downstream latency and increased error rates served by the web service responsible for rendering shortcuts and the assistant notification page. ## Mitigation: The immediate impact was mitigated by restarting the Employee Experience integrations API, and services were restored by 2:42 PM PDT \(21:42 UTC\). While investigations into the root cause continued, the incident recurred the following day – October 1st, 2024, at 12:54 PM PDT \(19:54 UTC\). The Employee Experience integrations API and the dependent Employee Experience user-integrations request processing service \(Pythia\) were restarted, restoring Shortcuts and the Assistant endpoints by 1:46 PM PDT \(20:46 UTC\). Cache resources for Pythia were increased to mitigate the observed latency. ## Recurrence Prevention: To prevent this incident from recurring, our engineering incident response team: * Has developed a fix to optimize how user-integrations requests use the cache to reduce memory consumption and eliminate latency. * This fix will be released during our scheduled Software Release maintenance window on October 15th, 2024. * Will be adding a monitoring and alerting dashboard for the Employee Experience user-integrations requests processing service \(Pythia\).
Status: Postmortem
Impact: None | Started At: Sept. 30, 2024, 9:36 p.m.
Description: **Summary:** On September 16th, 2024, starting at around 11:00 AM PDT, we started receiving customer reports stating that the Web and Mobile Experiences endpoints were unavailable. Following a correlation of these reports and system monitors, a platform incident was declared at 11:14 AM PDT. **Severity:** Sev1 **Scope:** Any user on the US platform attempting to access the Web and Mobile Experiences intermittently received an error message, and the Employee Experience failed to load. **Impact:** The core Web and Mobile Experiences platform endpoints were intermittently unavailable for the duration of the incident \(1hr 38mins\). **Root Cause:** The root cause was determined to be an exhaustion of the available database connections due to a sudden burst of user engagement activity that correlated to a small number of high-visibility campaigns. At 10:50 AM PDT, a dependent back-end service entered into a crash loop back-off state due to the database connection requests being refused and returned the error message to end users. **Mitigation:** The immediate problem was mitigated by fully redeploying the Employee Experience microservice after initial failed attempts at more surgical standardized mitigation maneuvers proved ineffective. Earlier maneuvers focused on reducing database load by temporarily disabling platform features and functionality that make heavy use of database transactions, which reduced error rates overall, but did not eliminate Customer impact. Web and Mobile Experience availability was restored by 12:28 PM PDT. **Recurrence Prevention:** To prevent this incident from recurring, our engineering incident response team has: * Increased the available database connections by 40% to account for any unforeseen spikes in platform traffic. * Added circuit breakers that would intercept abnormal increases in platform traffic, thereby maintaining platform endpoints availability. * Added an additional incident mitigation maneuver to disable campaign reactions such that a full-service redeploy would not be required to restore platform availability.
Status: Postmortem
Impact: None | Started At: Sept. 16, 2024, 6:14 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.