Last checked: 7 minutes ago
Get notified about any outages, downtime or incidents for Firstup and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Firstup.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|
View the latest incidents for Firstup and check for official updates:
Description: ## **Summary:** On April 22nd, 2024, at 6:15 AM PT \(13:15 UTC\) we began receiving reports of scheduled campaigns experiencing delays or which had not been published at all. Two sources were identified that lead to the delays, and were subsequently addressed in two separate hotfixes. ## **Impact:** Impact was most visible in campaign reporting delivery metrics showing that campaigns had either not gone out at the expected time, or email deliveries themselves arrived well after the scheduled time. Not all campaigns were affected, and actual delays ranged from several minutes up to an hour or longer in a small number of instances. ## **Root Cause:** Root Cause was determined to be related to a scheduled database upgrade performed on April 19th which resulted in degraded performance characteristics of the scheduling service. There were two underlying observable symptoms: 1. On April 22nd, the actual delivery of some emails was slower than expected as a result of several database queries that were not optimized for the new database software version deployed on April 19th. These queries ran slower after the upgrade when under higher load levels than what had been initially tested against. 2. The number of scheduled campaigns not executing at the precise scheduled time increased dramatically, also following the database upgrade, as a result of several newly uncovered bugs in the scheduling service itself. ## **Mitigation:** A number of mitigation measures were put into place to address different aspects of this platform incident over the course of several days. * The database query optimizations were deployed in a hotfix on April 22nd at 4:30 PM PT \(23:30 UTC\). This was specifically aimed at addressing the email delivery slowness issue. * For Customers who opened support tickets related to specific scheduled campaigns being delayed, those campaigns were manually published as a part of the individual support tickets. Also, a separate query was run on an as-needed basis to proactively identify other campaigns in a similar state, and manually publish those as well. * A second hotfix was deployed on April 24th at 11:30 AM PT \(18:30 UTC\) to add an automated backstop measure to catch and publish any campaigns that had been scheduled at an earlier time but had not actually started. ## **Recurrence Prevention:** The following actions have been committed to fully resolving the incident and eliminating the reliance on the mitigation measure currently in place. * Create improved platform alerting for campaign delivery times to identify and address degraded state earlier. * Fix remaining 3 bugs uncovered during the incident investigation process as well as making the scheduler service itself more robust.
Status: Postmortem
Impact: None | Started At: April 22, 2024, 4:53 p.m.
Description: ### **Summary:** On Tuesday April 16th, 2024, starting at approximately 9:54 AM UTC to 11:09 AM UTC, EU Studio experienced multiple service disruptions including general slowness with loading Studio functions, issues with login as well as HTTP 500 system error messages. It was identified that a number of backend services were experiencing TCP \(Transmission Control Protocol\) networking issues that manifested as a variety of user-visible errors and unpredictable product interactions. ### **Impact:** Affected users were unable to login into Studio, as well as experienced general slowness and system error messages such as “504 Gateway Timeout” or “502 Bad Gateway” due to the backend services having network errors. ### **Root Cause:** The root cause was determined to be an unexpected spike in traffic which caused a number of nodes \(worker machines\) to rapidly increase to handle the additional workload. This led to DNS \(Domain Name Service\) request timeouts as it exceeded the overall capacity for inbound DNS traffic when these nodes increased. ### **Mitigation:** The immediate problem was mitigated by increasing DNS capacity within the EU infrastructure and restarting the affected services, restoring system services and performance by 11:09 AM UTC. ### **Recurrence Prevention:** Below changes have been implemented to prevent unexpected loss of DNS service capacity. * An alert will now fire within the EU infrastructure any time the internal DNS capacity drops below the minimal viable threshold determined by Site Reliability Engineering. * Load testing has been performed to ensure scalability and appropriate buffer for potential spikes and organic growth in DNS request volume.
Status: Postmortem
Impact: None | Started At: April 16, 2024, 10:04 a.m.
Description: **Summary:** On March 15th, 2024, we started receiving reports where scheduled campaigns experienced delays in publishing at the scheduled time or did not publish at all at the scheduled time. **Impact:** The impact was restricted to any scheduled campaigns on the FirstUp platform scheduled to publish on March 15th, 2024, between 1:00 AM ET \(05:00 UTC\) and 8:04 PM ET \(March 16th, 2024 - 00:04 UTC\). **Root Cause:** The root cause was determined to be a regression to a software change to the “scheduled campaign callback service” that was deployed during our scheduled software release window the previous day causing a callback to the “scheduling service” \(to publish a scheduled campaign at the scheduled time\) to fail. **Mitigation:** A hotfix was deployed by 8:04 PM ET \(March 16th, 2024 - 00:04 UTC\) to address the software regression introduced in the campaign scheduling software. Any delayed scheduled campaigns were also manually published by the same time. **Recurrence Prevention:** The Incident Response Team has taken the following actions in an effort to prevent a recurrence of this incident: * Implemented additional pre-release regression testing around the “scheduling service”. * Documented the SQL rake task used to identify any failed/delayed scheduled campaigns in a runbook to aid in quickly mitigating any future similar incidents. * Created monitors to alert us on the first instance of a failed/delayed scheduled campaign to enable us to proactively get ahead of any campaign scheduling issue\(s\) and prevent similar platform-wide incidents.
Status: Postmortem
Impact: None | Started At: March 15, 2024, 8:19 p.m.
Description: ## Summary: On February 28th, 2024, starting at around 1:11 PM PT \(18:11 UTC\), we started receiving reports that some users had not received an email from a scheduled campaign, and subsequently additional reports on February 29th, 2024, where some scheduled campaigns were still showing in the scheduled folder in Studio past their scheduled publish time. ## Impact: Impact was primarily related to campaigns that were scheduled to publish between 02.28.2024 at 11:16 AM ET and 02.29.2024 at 1:06 PM ET. ## Root Cause: The root cause was determined to be memory exhaustion in our core database on 02.28.2024 at 11:16 AM ET, which triggered an automatic database failover by AWS infrastructure failure service. Post-failover, dependent services that manage scheduled campaigns did not automatically reconnect to the failover database, and therefore could not initiate a “publish” event for scheduled campaigns at the scheduled time. ## Mitigation: The immediate problem was mitigated by querying the database for past-due scheduled campaigns and manually publishing them. Additionally, the services responsible for scheduled campaigns were manually restarted to establish connections to the failover database, in effect allowing them to initiate “publish” events for scheduled campaigns as expected. ## Recurrence Prevention: An incident response team post-mortem meeting revealed the following as recurrence prevention measures to be taken: ● Removal of SQL comments to reduce database memory consumption. ● Increase database instance size by upgrading the Postgres version. ● Improve monitoring and alerting on database connections and memory usage using dedicated dashboards that include links to runbooks and mitigation instructions. ● Fix failover and error handling in the affected services.
Status: Postmortem
Impact: None | Started At: Feb. 29, 2024, 4:11 p.m.
Description: ## Summary: On February 15th, 2024, beginning at approximately 5:50 AM PT \(13:50 UTC\), we started receiving reports of several platform services being unavailable, including Microapps and Partner APIs. Errors persisted intermittently for just over an hour primarily for these two services as well as any new user requests that required IP address resolution through an authoritative DNS \(domain name service\) server. ## Impact: The impact was primarily related to services which have very low TTL \(time to live\) thresholds for DNS and new end-user requests that required a new DNS lookup first. Observed error conditions included request timeouts and HTTP 500 gateway errors. Multiple services were in scope of the platform incident and availability would have depended on whether the service IP had been cached locally or whether the DNS request was able to be serviced within the lower level of available capacity. ## Root Cause: The root cause was determined to be an unexpected drop in overall DNS service capacity. An earlier planned maintenance regressed an earlier performance improvement that resulted in the reduction of the number of Core DNS services that would run in production, thus limiting the overall available capacity for inbound DNS traffic. ## Mitigation: The immediate problem was mitigated by restoring Core DNS capacity as soon as the discrepancy was discovered at 6:30 AM PT \(14:30 UTC\) by the incident response team. Remaining error rates began to improve markedly by 6:45 AM PT \(14:45 UTC\) and all services were confirmed to be fully stabilized by 7:15 AM PT \(15:15 UTC\). ## Recurrence Prevention: A technical team postmortem meeting reviewed the change management process that allowed an errant default setting for the number of DNS nodes to be pushed to production, how to improve platform alert visibility of this condition in the future, and how to prevent unexpected loss of DNS service capacity. The following changes have since been instituted: * An alert will now fire any time the core DNS capacity drops below the minimal viable threshold determined by Site Reliability Engineering. * All core service nodes will now launch with an attached DNS service component automatically. * Load testing has been performed to ensure scalability and appropriate buffer for potential spikes and organic growth in DNS request volume. * Updated infrastructure change management to ensure that any future configuration changes would persist following service restarts.
Status: Postmortem
Impact: None | Started At: Feb. 15, 2024, 2:28 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.