Last checked: 4 minutes ago
Get notified about any outages, downtime or incidents for Firstup and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Firstup.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|
View the latest incidents for Firstup and check for official updates:
Description: ## Summary: On September 30th, 2024, beginning at approximately 1:24 PM PDT \(20:24 UTC\), we started receiving reports of Shortcuts intermittently being unavailable and the Assistant returning an error in the Employee Experience. A platform incident was declared at 2:36 PM PDT \(21:36 UTC\) after initial investigations revealed the issue to be platform-wide. ## Severity: Sev2 ## Scope: Any user on the US platform accessing the Web or Mobile Experiences intermittently experienced missing Shortcuts and/or received an error message while accessing the Assistant. A refresh of the Employee Experience page occasionally restored these endpoints. All other services in the Employee Experience remained available and functional. ## Impact: Shortcuts and the Assistant endpoints in the Employee Experience were intermittently unavailable during the incident. ## Root Cause: The root cause was determined to be due to an uncharacteristically high number of new user integrations introduced within a short period of time that exacerbated a newly uncovered non-optimized content caching behavior. This caused downstream latency and increased error rates served by the web service responsible for rendering shortcuts and the assistant notification page. ## Mitigation: The immediate impact was mitigated by restarting the Employee Experience integrations API, and services were restored by 2:42 PM PDT \(21:42 UTC\). While investigations into the root cause continued, the incident recurred the following day – October 1st, 2024, at 12:54 PM PDT \(19:54 UTC\). The Employee Experience integrations API and the dependent Employee Experience user-integrations request processing service \(Pythia\) were restarted, restoring Shortcuts and the Assistant endpoints by 1:46 PM PDT \(20:46 UTC\). Cache resources for Pythia were increased to mitigate the observed latency. ## Recurrence Prevention: To prevent this incident from recurring, our engineering incident response team: * Has developed a fix to optimize how user-integrations requests use the cache to reduce memory consumption and eliminate latency. * This fix will be released during our scheduled Software Release maintenance window on October 15th, 2024. * Will be adding a monitoring and alerting dashboard for the Employee Experience user-integrations requests processing service \(Pythia\).
Status: Postmortem
Impact: None | Started At: Sept. 30, 2024, 9:36 p.m.
Description: **Summary:** On September 16th, 2024, starting at around 11:00 AM PDT, we started receiving customer reports stating that the Web and Mobile Experiences endpoints were unavailable. Following a correlation of these reports and system monitors, a platform incident was declared at 11:14 AM PDT. **Severity:** Sev1 **Scope:** Any user on the US platform attempting to access the Web and Mobile Experiences intermittently received an error message, and the Employee Experience failed to load. **Impact:** The core Web and Mobile Experiences platform endpoints were intermittently unavailable for the duration of the incident \(1hr 38mins\). **Root Cause:** The root cause was determined to be an exhaustion of the available database connections due to a sudden burst of user engagement activity that correlated to a small number of high-visibility campaigns. At 10:50 AM PDT, a dependent back-end service entered into a crash loop back-off state due to the database connection requests being refused and returned the error message to end users. **Mitigation:** The immediate problem was mitigated by fully redeploying the Employee Experience microservice after initial failed attempts at more surgical standardized mitigation maneuvers proved ineffective. Earlier maneuvers focused on reducing database load by temporarily disabling platform features and functionality that make heavy use of database transactions, which reduced error rates overall, but did not eliminate Customer impact. Web and Mobile Experience availability was restored by 12:28 PM PDT. **Recurrence Prevention:** To prevent this incident from recurring, our engineering incident response team has: * Increased the available database connections by 40% to account for any unforeseen spikes in platform traffic. * Added circuit breakers that would intercept abnormal increases in platform traffic, thereby maintaining platform endpoints availability. * Added an additional incident mitigation maneuver to disable campaign reactions such that a full-service redeploy would not be required to restore platform availability.
Status: Postmortem
Impact: None | Started At: Sept. 16, 2024, 6:14 p.m.
Description: **Summary:** On September 4th, 2024, starting at 4:27 AM EDT, reports of Studio users unable to view or edit campaigns in Studio were received. Following a correlation of customer reports and initial troubleshooting, a platform service degradation incident was declared at 9:21 AM EDT, and published on our Status Page at 9:41 AM EDT. **Scope:** The scope of this service degradation was restricted to Studio users with multiple audiences assigned to them. **Impact:** Studio users who had multiple audiences assigned to them were unable to view or edit campaigns during the duration of this incident \(12hrs 46mins\). No scheduled campaigns were affected, and campaign viewing, editing, and publishing processes were not inhibited for other users with no audiences assigned or had a single audience assigned. **Root Cause:** The root cause of this incident was determined to be a regression to a misconfigured platform enhancement policy change intended to improve the efficiency of how user-assigned audiences were queried, which had been released at 12:07 AM EDT as part of the scheduled software release maintenance the same day. **Mitigation:** A rollback of the offending policy change was performed and completed by 12:53 PM EDT to restore access to Studio campaigns. **Recurrence Prevention:** To prevent this incident from recurring, we will perform the following actions before releasing the platform enhancement policy change in the future: * Review and correct any misconfiguration on the platform enhancement policy change code. * Add more unit test cases to cover multiple audiences on the modified queries. * Add more regression test cases to cover users with multiple audiences.
Status: Postmortem
Impact: None | Started At: Sept. 4, 2024, 1:41 p.m.
Description: **Summary:** On August 28th, 2024, at 1:03 PM EDT, system monitors alerted us of failing database health checks, and our engineering team immediately started investigating these alerts. Customer reports of core platform endpoints being unresponsive and/or returning error messages were received beginning at 1:12 PM EDT, and a platform incident was declared at 1:21 PM EDT. **Scope:** Any user on the US platform attempting to access or navigate through the Web and Mobile Experience, as well as Studio, was impacted by this incident. **Impact:** Core US platform endpoints such as Web and Mobile Experiences, as well as Studio, were slow to load or became intermittently unavailable for the duration of the incident \(48 minutes\). **Root Cause:** The root cause was determined to be a slow-running query for “user unread posts” that saw a huge spike in traffic following a campaign that was published to a large audience. As a result, the database CPU spiked and stopped taking new connection requests, causing new Web and Mobile Experience requests, as well as Studio requests to fail and the system appeared to be unresponsive. **Mitigation:** The immediate problem was mitigated by reducing the number of pods submitting requests to the database by half to alleviate the load on the database, which restored database responsiveness and platform endpoints availability by 1:51 PM EDT. **Recurrence Prevention:** To prevent this incident from recurring, our engineering incident response team has optimized the offending “slow-running” query to perform 2x faster, thereby reducing the required database CPU resources. We are also working on implementing circuit breakers on the offending downstream services from the database, to prevent database CPU overutilization, to ensure platform endpoints availability.
Status: Postmortem
Impact: None | Started At: Aug. 28, 2024, 5:21 p.m.
Description: **Summary:** From approximately 11:08 am - 11:38 am PT \(18:08 pm - 18:38 pm UTC\), Thursday August 22nd, both Studio and Web Experience were unavailable due to the release of Version 2 of Personalized Fields \(PFV2\), a new feature with the Q3 quarterly update that was more resource intensive than initially planned. This caused high CPU usage, increased query latency and database connection pool exhaustion. **Impact:** The scope of this incident primarily affected users who attempted to access Studio services and Web Experience between 11:08 am - 11:38 am PT. The issue manifested itself in the following observable ways through below errors on the frontend of the platform: * We’re sorry, but something went wrong. * 502 Bad Gateway. * There was an error processing your request. Please try again. **Root Cause:** The root cause was determined to be the release of Version 2 of Personalized Fields \(PFV2\), a new feature with the Q3 quarterly update that has been more resource intensive than was initially planned. The feature caused a significant increase in CPU usage, query latency on the shared database cluster and database connection pool exhaustion. This resulted in the Studio/Web Experience service unavailability and error messages observed by impacted users. **Mitigation:** The immediate impact was mitigated by temporarily disabling the newly released feature that was causing excessive resource consumption. The cache Time-To-Live \(TTL\) was also changed from 1 minute to 3 hours to reduce load and stabilize performance. After service was restored, we conducted platform tuning and scaled up infrastructure outside business hours to accommodate the increased load with the introduction of this new feature. **Recurrence Prevention:** To prevent a recurrence of this incident, the below actions have or are being implemented: * Load Testing and Analysis - More rigorous load testing and analysis to detect N\+1 calls or latency spikes before a feature goes live. * Infrastructure Planning and Caching Strategy - Refactor the caching for the affected feature, including pre-warming caches in batches to prevent cache-miss cascades, and optimizing the infrastructure to handle increased load efficiently whilst only caching what is needed. * Remove custom attributes for blocked users who have been inactive for a specific period to reduce table size and improve query performance. * Feature Flagging and Gradual Rollouts - Future high-risk changes will be rolled out gradually and improved resource monitoring performance will be done before full deployment.
Status: Postmortem
Impact: None | Started At: Aug. 22, 2024, 6:27 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.