Last checked: 9 minutes ago
Get notified about any outages, downtime or incidents for Firstup and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Firstup.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
3rd-Party Dependencies | Active |
Identity Access Management | Active |
Image Transformation API | Active |
SendGrid API v3 | Active |
Zoom Zoom Virtual Agent | Active |
Ecosystem | Active |
Connect | Active |
Integrations | Active |
Partner API | Active |
User Sync | Active |
Platforms | Active |
EU Firstup Platform | Active |
US Firstup Platform | Active |
Products | Active |
Classic Studio | Active |
Creator Studio | Active |
Insights | Active |
Microapps | Active |
Mobile Experience | Active |
Web Experience | Active |
View the latest incidents for Firstup and check for official updates:
Description: ## Summary: On February 28th, 2024, starting at around 1:11 PM PT \(18:11 UTC\), we started receiving reports that some users had not received an email from a scheduled campaign, and subsequently additional reports on February 29th, 2024, where some scheduled campaigns were still showing in the scheduled folder in Studio past their scheduled publish time. ## Impact: Impact was primarily related to campaigns that were scheduled to publish between 02.28.2024 at 11:16 AM ET and 02.29.2024 at 1:06 PM ET. ## Root Cause: The root cause was determined to be memory exhaustion in our core database on 02.28.2024 at 11:16 AM ET, which triggered an automatic database failover by AWS infrastructure failure service. Post-failover, dependent services that manage scheduled campaigns did not automatically reconnect to the failover database, and therefore could not initiate a “publish” event for scheduled campaigns at the scheduled time. ## Mitigation: The immediate problem was mitigated by querying the database for past-due scheduled campaigns and manually publishing them. Additionally, the services responsible for scheduled campaigns were manually restarted to establish connections to the failover database, in effect allowing them to initiate “publish” events for scheduled campaigns as expected. ## Recurrence Prevention: An incident response team post-mortem meeting revealed the following as recurrence prevention measures to be taken: ● Removal of SQL comments to reduce database memory consumption. ● Increase database instance size by upgrading the Postgres version. ● Improve monitoring and alerting on database connections and memory usage using dedicated dashboards that include links to runbooks and mitigation instructions. ● Fix failover and error handling in the affected services.
Status: Postmortem
Impact: None | Started At: Feb. 29, 2024, 4:11 p.m.
Description: ## Summary: On February 28th, 2024, starting at around 1:11 PM PT \(18:11 UTC\), we started receiving reports that some users had not received an email from a scheduled campaign, and subsequently additional reports on February 29th, 2024, where some scheduled campaigns were still showing in the scheduled folder in Studio past their scheduled publish time. ## Impact: Impact was primarily related to campaigns that were scheduled to publish between 02.28.2024 at 11:16 AM ET and 02.29.2024 at 1:06 PM ET. ## Root Cause: The root cause was determined to be memory exhaustion in our core database on 02.28.2024 at 11:16 AM ET, which triggered an automatic database failover by AWS infrastructure failure service. Post-failover, dependent services that manage scheduled campaigns did not automatically reconnect to the failover database, and therefore could not initiate a “publish” event for scheduled campaigns at the scheduled time. ## Mitigation: The immediate problem was mitigated by querying the database for past-due scheduled campaigns and manually publishing them. Additionally, the services responsible for scheduled campaigns were manually restarted to establish connections to the failover database, in effect allowing them to initiate “publish” events for scheduled campaigns as expected. ## Recurrence Prevention: An incident response team post-mortem meeting revealed the following as recurrence prevention measures to be taken: ● Removal of SQL comments to reduce database memory consumption. ● Increase database instance size by upgrading the Postgres version. ● Improve monitoring and alerting on database connections and memory usage using dedicated dashboards that include links to runbooks and mitigation instructions. ● Fix failover and error handling in the affected services.
Status: Postmortem
Impact: None | Started At: Feb. 29, 2024, 4:11 p.m.
Description: ## Summary: On February 15th, 2024, beginning at approximately 5:50 AM PT \(13:50 UTC\), we started receiving reports of several platform services being unavailable, including Microapps and Partner APIs. Errors persisted intermittently for just over an hour primarily for these two services as well as any new user requests that required IP address resolution through an authoritative DNS \(domain name service\) server. ## Impact: The impact was primarily related to services which have very low TTL \(time to live\) thresholds for DNS and new end-user requests that required a new DNS lookup first. Observed error conditions included request timeouts and HTTP 500 gateway errors. Multiple services were in scope of the platform incident and availability would have depended on whether the service IP had been cached locally or whether the DNS request was able to be serviced within the lower level of available capacity. ## Root Cause: The root cause was determined to be an unexpected drop in overall DNS service capacity. An earlier planned maintenance regressed an earlier performance improvement that resulted in the reduction of the number of Core DNS services that would run in production, thus limiting the overall available capacity for inbound DNS traffic. ## Mitigation: The immediate problem was mitigated by restoring Core DNS capacity as soon as the discrepancy was discovered at 6:30 AM PT \(14:30 UTC\) by the incident response team. Remaining error rates began to improve markedly by 6:45 AM PT \(14:45 UTC\) and all services were confirmed to be fully stabilized by 7:15 AM PT \(15:15 UTC\). ## Recurrence Prevention: A technical team postmortem meeting reviewed the change management process that allowed an errant default setting for the number of DNS nodes to be pushed to production, how to improve platform alert visibility of this condition in the future, and how to prevent unexpected loss of DNS service capacity. The following changes have since been instituted: * An alert will now fire any time the core DNS capacity drops below the minimal viable threshold determined by Site Reliability Engineering. * All core service nodes will now launch with an attached DNS service component automatically. * Load testing has been performed to ensure scalability and appropriate buffer for potential spikes and organic growth in DNS request volume. * Updated infrastructure change management to ensure that any future configuration changes would persist following service restarts.
Status: Postmortem
Impact: None | Started At: Feb. 15, 2024, 2:28 p.m.
Description: On February 12th, 2024, starting at around 9:20 AM ET, we received reports that some users were experiencing issues accessing Creator Studio. The reports included general slowness loading Studio functions, as well as system error messages such as “Failed to fetch. Try again” or “504 Bad Gateway”. During our investigations, it was identified that the number of available database connections was exhausted, and therefore new Creator Studio requests could not establish a connection to the database. To mitigate this situation, a rolling redeployment of dependent backend services that had connections to the database was performed to free up any “stuck” connections, thus allowing new Creator Studio requests to connect to the database. As a recurrence prevention measure to this incident, we are working on reducing the demand for database connections in the short term via methods such as: * Connection pooling * Read/write splitting In the long term, we will be looking at: * Upgrading our database infrastructure * Increasing our database connection limits.
Status: Postmortem
Impact: None | Started At: Feb. 12, 2024, 2:47 p.m.
Description: On February 12th, 2024, starting at around 9:20 AM ET, we received reports that some users were experiencing issues accessing Creator Studio. The reports included general slowness loading Studio functions, as well as system error messages such as “Failed to fetch. Try again” or “504 Bad Gateway”. During our investigations, it was identified that the number of available database connections was exhausted, and therefore new Creator Studio requests could not establish a connection to the database. To mitigate this situation, a rolling redeployment of dependent backend services that had connections to the database was performed to free up any “stuck” connections, thus allowing new Creator Studio requests to connect to the database. As a recurrence prevention measure to this incident, we are working on reducing the demand for database connections in the short term via methods such as: * Connection pooling * Read/write splitting In the long term, we will be looking at: * Upgrading our database infrastructure * Increasing our database connection limits.
Status: Postmortem
Impact: None | Started At: Feb. 12, 2024, 2:47 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.