Last checked: 7 minutes ago
Get notified about any outages, downtime or incidents for Firstup and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Firstup.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|
View the latest incidents for Firstup and check for official updates:
Description: **Summary:** On September 4th, 2024, starting at 4:27 AM EDT, reports of Studio users unable to view or edit campaigns in Studio were received. Following a correlation of customer reports and initial troubleshooting, a platform service degradation incident was declared at 9:21 AM EDT, and published on our Status Page at 9:41 AM EDT. **Scope:** The scope of this service degradation was restricted to Studio users with multiple audiences assigned to them. **Impact:** Studio users who had multiple audiences assigned to them were unable to view or edit campaigns during the duration of this incident \(12hrs 46mins\). No scheduled campaigns were affected, and campaign viewing, editing, and publishing processes were not inhibited for other users with no audiences assigned or had a single audience assigned. **Root Cause:** The root cause of this incident was determined to be a regression to a misconfigured platform enhancement policy change intended to improve the efficiency of how user-assigned audiences were queried, which had been released at 12:07 AM EDT as part of the scheduled software release maintenance the same day. **Mitigation:** A rollback of the offending policy change was performed and completed by 12:53 PM EDT to restore access to Studio campaigns. **Recurrence Prevention:** To prevent this incident from recurring, we will perform the following actions before releasing the platform enhancement policy change in the future: * Review and correct any misconfiguration on the platform enhancement policy change code. * Add more unit test cases to cover multiple audiences on the modified queries. * Add more regression test cases to cover users with multiple audiences.
Status: Postmortem
Impact: None | Started At: Sept. 4, 2024, 1:41 p.m.
Description: **Summary:** On August 28th, 2024, at 1:03 PM EDT, system monitors alerted us of failing database health checks, and our engineering team immediately started investigating these alerts. Customer reports of core platform endpoints being unresponsive and/or returning error messages were received beginning at 1:12 PM EDT, and a platform incident was declared at 1:21 PM EDT. **Scope:** Any user on the US platform attempting to access or navigate through the Web and Mobile Experience, as well as Studio, was impacted by this incident. **Impact:** Core US platform endpoints such as Web and Mobile Experiences, as well as Studio, were slow to load or became intermittently unavailable for the duration of the incident \(48 minutes\). **Root Cause:** The root cause was determined to be a slow-running query for “user unread posts” that saw a huge spike in traffic following a campaign that was published to a large audience. As a result, the database CPU spiked and stopped taking new connection requests, causing new Web and Mobile Experience requests, as well as Studio requests to fail and the system appeared to be unresponsive. **Mitigation:** The immediate problem was mitigated by reducing the number of pods submitting requests to the database by half to alleviate the load on the database, which restored database responsiveness and platform endpoints availability by 1:51 PM EDT. **Recurrence Prevention:** To prevent this incident from recurring, our engineering incident response team has optimized the offending “slow-running” query to perform 2x faster, thereby reducing the required database CPU resources. We are also working on implementing circuit breakers on the offending downstream services from the database, to prevent database CPU overutilization, to ensure platform endpoints availability.
Status: Postmortem
Impact: None | Started At: Aug. 28, 2024, 5:21 p.m.
Description: **Summary:** From approximately 11:08 am - 11:38 am PT \(18:08 pm - 18:38 pm UTC\), Thursday August 22nd, both Studio and Web Experience were unavailable due to the release of Version 2 of Personalized Fields \(PFV2\), a new feature with the Q3 quarterly update that was more resource intensive than initially planned. This caused high CPU usage, increased query latency and database connection pool exhaustion. **Impact:** The scope of this incident primarily affected users who attempted to access Studio services and Web Experience between 11:08 am - 11:38 am PT. The issue manifested itself in the following observable ways through below errors on the frontend of the platform: * We’re sorry, but something went wrong. * 502 Bad Gateway. * There was an error processing your request. Please try again. **Root Cause:** The root cause was determined to be the release of Version 2 of Personalized Fields \(PFV2\), a new feature with the Q3 quarterly update that has been more resource intensive than was initially planned. The feature caused a significant increase in CPU usage, query latency on the shared database cluster and database connection pool exhaustion. This resulted in the Studio/Web Experience service unavailability and error messages observed by impacted users. **Mitigation:** The immediate impact was mitigated by temporarily disabling the newly released feature that was causing excessive resource consumption. The cache Time-To-Live \(TTL\) was also changed from 1 minute to 3 hours to reduce load and stabilize performance. After service was restored, we conducted platform tuning and scaled up infrastructure outside business hours to accommodate the increased load with the introduction of this new feature. **Recurrence Prevention:** To prevent a recurrence of this incident, the below actions have or are being implemented: * Load Testing and Analysis - More rigorous load testing and analysis to detect N\+1 calls or latency spikes before a feature goes live. * Infrastructure Planning and Caching Strategy - Refactor the caching for the affected feature, including pre-warming caches in batches to prevent cache-miss cascades, and optimizing the infrastructure to handle increased load efficiently whilst only caching what is needed. * Remove custom attributes for blocked users who have been inactive for a specific period to reduce table size and improve query performance. * Feature Flagging and Gradual Rollouts - Future high-risk changes will be rolled out gradually and improved resource monitoring performance will be done before full deployment.
Status: Postmortem
Impact: None | Started At: Aug. 22, 2024, 6:27 p.m.
Description: A hotfix has deployed which has been verified to correct the underlying behavior of the author alias reverting on campaign drafts. Marking all components fully operational and resolving incident.
Status: Resolved
Impact: None | Started At: Aug. 7, 2024, 4:31 p.m.
Description: Queued campaigns are now successfully being delivered and we are currently monitoring the issue.
Status: Monitoring
Impact: None | Started At: July 23, 2024, 11:42 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.