Firstup Status: Check if Firstup down or having an outage.

Platform Service Degradation - Campaign Email Delivery Delayed

Description: ## **Summary:** On April 22nd, 2024, at 6:15 AM PT \(13:15 UTC\) we began receiving reports of scheduled campaigns experiencing delays or which had not been published at all. Two sources were identified that lead to the delays, and were subsequently addressed in two separate hotfixes. ## **Impact:** Impact was most visible in campaign reporting delivery metrics showing that campaigns had either not gone out at the expected time, or email deliveries themselves arrived well after the scheduled time. Not all campaigns were affected, and actual delays ranged from several minutes up to an hour or longer in a small number of instances. ## **Root Cause:** Root Cause was determined to be related to a scheduled database upgrade performed on April 19th which resulted in degraded performance characteristics of the scheduling service. There were two underlying observable symptoms: 1. On April 22nd, the actual delivery of some emails was slower than expected as a result of several database queries that were not optimized for the new database software version deployed on April 19th. These queries ran slower after the upgrade when under higher load levels than what had been initially tested against. 2. The number of scheduled campaigns not executing at the precise scheduled time increased dramatically, also following the database upgrade, as a result of several newly uncovered bugs in the scheduling service itself. ## **Mitigation:** A number of mitigation measures were put into place to address different aspects of this platform incident over the course of several days. * The database query optimizations were deployed in a hotfix on April 22nd at 4:30 PM PT \(23:30 UTC\). This was specifically aimed at addressing the email delivery slowness issue. * For Customers who opened support tickets related to specific scheduled campaigns being delayed, those campaigns were manually published as a part of the individual support tickets. Also, a separate query was run on an as-needed basis to proactively identify other campaigns in a similar state, and manually publish those as well. * A second hotfix was deployed on April 24th at 11:30 AM PT \(18:30 UTC\) to add an automated backstop measure to catch and publish any campaigns that had been scheduled at an earlier time but had not actually started. ## **Recurrence Prevention:** The following actions have been committed to fully resolving the incident and eliminating the reliance on the mitigation measure currently in place. * Create improved platform alerting for campaign delivery times to identify and address degraded state earlier. * Fix remaining 3 bugs uncovered during the incident investigation process as well as making the scheduler service itself more robust.

Status: Postmortem

Impact: None | Started At: April 22, 2024, 4:53 p.m.

Updates:

Time: May 13, 2024, 10:13 p.m.

Status: Postmortem

Update: ## **Summary:** On April 22nd, 2024, at 6:15 AM PT \(13:15 UTC\) we began receiving reports of scheduled campaigns experiencing delays or which had not been published at all. Two sources were identified that lead to the delays, and were subsequently addressed in two separate hotfixes. ## **Impact:** Impact was most visible in campaign reporting delivery metrics showing that campaigns had either not gone out at the expected time, or email deliveries themselves arrived well after the scheduled time. Not all campaigns were affected, and actual delays ranged from several minutes up to an hour or longer in a small number of instances. ## **Root Cause:** Root Cause was determined to be related to a scheduled database upgrade performed on April 19th which resulted in degraded performance characteristics of the scheduling service. There were two underlying observable symptoms: 1. On April 22nd, the actual delivery of some emails was slower than expected as a result of several database queries that were not optimized for the new database software version deployed on April 19th. These queries ran slower after the upgrade when under higher load levels than what had been initially tested against. 2. The number of scheduled campaigns not executing at the precise scheduled time increased dramatically, also following the database upgrade, as a result of several newly uncovered bugs in the scheduling service itself. ## **Mitigation:** A number of mitigation measures were put into place to address different aspects of this platform incident over the course of several days. * The database query optimizations were deployed in a hotfix on April 22nd at 4:30 PM PT \(23:30 UTC\). This was specifically aimed at addressing the email delivery slowness issue. * For Customers who opened support tickets related to specific scheduled campaigns being delayed, those campaigns were manually published as a part of the individual support tickets. Also, a separate query was run on an as-needed basis to proactively identify other campaigns in a similar state, and manually publish those as well. * A second hotfix was deployed on April 24th at 11:30 AM PT \(18:30 UTC\) to add an automated backstop measure to catch and publish any campaigns that had been scheduled at an earlier time but had not actually started. ## **Recurrence Prevention:** The following actions have been committed to fully resolving the incident and eliminating the reliance on the mitigation measure currently in place. * Create improved platform alerting for campaign delivery times to identify and address degraded state earlier. * Fix remaining 3 bugs uncovered during the incident investigation process as well as making the scheduler service itself more robust.
Time: May 13, 2024, 10:12 p.m.

Status: Resolved

Update: Marking incident as resolved and all components fully operational. Automated mitigation measure has been demonstrated to be effective while remaining recurrence prevention items work their way through the system.
Time: April 24, 2024, 5:29 p.m.

Status: Monitoring

Update: We continue to monitor the services that were impacted for any further issues.
Time: April 23, 2024, 2:06 a.m.

Status: Monitoring

Update: We have deployed a hotfix to address the database performance issue as it relates to the campaign email delivery queue. All affected services remain fully stable and available. We will be placing these services under monitoring for now.
Time: April 22, 2024, 7:38 p.m.

Status: Identified

Update: We have identified a database performance issue, and are working to address it. The email delivery pipeline queue was backed up, resulting in campaign email deliveries being delayed. This queue has since caught up, and campaign emails are now being delivered as expected. We will provide another update as soon as more information is made available.
Time: April 22, 2024, 7:01 p.m.

Status: Investigating

Update: We continue to investigate the delays in campaign email deliveries. Another update in 1 hour.
Time: April 22, 2024, 5:50 p.m.

Status: Investigating

Update: We continue to investigate the delays in campaign email deliveries. We have observed that the campaigns are publishing as expected, and therefore no need to republish them. Another update within 1 hour.
Time: April 22, 2024, 4:53 p.m.

Status: Investigating

Update: We are currently investigating reports of delayed campaign email deliveries and associated reporting. We will provide you with an update within 1 hour.

EU Communities - Studio Service Disruption

Description: ### **Summary:** On Tuesday April 16th, 2024, starting at approximately 9:54 AM UTC to 11:09 AM UTC, EU Studio experienced multiple service disruptions including general slowness with loading Studio functions, issues with login as well as HTTP 500 system error messages. It was identified that a number of backend services were experiencing TCP \(Transmission Control Protocol\) networking issues that manifested as a variety of user-visible errors and unpredictable product interactions. ### **Impact:** Affected users were unable to login into Studio, as well as experienced general slowness and system error messages such as “504 Gateway Timeout” or “502 Bad Gateway” due to the backend services having network errors. ### **Root Cause:** The root cause was determined to be an unexpected spike in traffic which caused a number of nodes \(worker machines\) to rapidly increase to handle the additional workload. This led to DNS \(Domain Name Service\) request timeouts as it exceeded the overall capacity for inbound DNS traffic when these nodes increased. ### **Mitigation:** The immediate problem was mitigated by increasing DNS capacity within the EU infrastructure and restarting the affected services, restoring system services and performance by 11:09 AM UTC. ### **Recurrence Prevention:** Below changes have been implemented to prevent unexpected loss of DNS service capacity. ‌ * An alert will now fire within the EU infrastructure any time the internal DNS capacity drops below the minimal viable threshold determined by Site Reliability Engineering. * Load testing has been performed to ensure scalability and appropriate buffer for potential spikes and organic growth in DNS request volume.

Status: Postmortem

Impact: None | Started At: April 16, 2024, 10:04 a.m.

Updates:

Time: June 7, 2024, 8:08 p.m.

Status: Postmortem

Update: ### **Summary:** On Tuesday April 16th, 2024, starting at approximately 9:54 AM UTC to 11:09 AM UTC, EU Studio experienced multiple service disruptions including general slowness with loading Studio functions, issues with login as well as HTTP 500 system error messages. It was identified that a number of backend services were experiencing TCP \(Transmission Control Protocol\) networking issues that manifested as a variety of user-visible errors and unpredictable product interactions. ### **Impact:** Affected users were unable to login into Studio, as well as experienced general slowness and system error messages such as “504 Gateway Timeout” or “502 Bad Gateway” due to the backend services having network errors. ### **Root Cause:** The root cause was determined to be an unexpected spike in traffic which caused a number of nodes \(worker machines\) to rapidly increase to handle the additional workload. This led to DNS \(Domain Name Service\) request timeouts as it exceeded the overall capacity for inbound DNS traffic when these nodes increased. ### **Mitigation:** The immediate problem was mitigated by increasing DNS capacity within the EU infrastructure and restarting the affected services, restoring system services and performance by 11:09 AM UTC. ### **Recurrence Prevention:** Below changes have been implemented to prevent unexpected loss of DNS service capacity. ‌ * An alert will now fire within the EU infrastructure any time the internal DNS capacity drops below the minimal viable threshold determined by Site Reliability Engineering. * Load testing has been performed to ensure scalability and appropriate buffer for potential spikes and organic growth in DNS request volume.
Time: May 2, 2024, 3:56 p.m.

Status: Resolved

Update: Studio has remained fully accessible for EU communities following the applied fix. This platform service degradation is now resolved, and an RCA will be provided once a full incident postmortem has been completed.
Time: April 16, 2024, 11:11 a.m.

Status: Monitoring

Update: We have applied a fix for the issue affecting Studio on EU communities. We are continuing to monitor and will update once we have confirmed that the platform is stable.
Time: April 16, 2024, 10:51 a.m.

Status: Investigating

Update: We are continuing to investigate this issue affecting Studio for EU communities and working to restore service. We'll provide another update within the next 30 minutes.
Time: April 16, 2024, 10:04 a.m.

Status: Investigating

Update: We are investigating a service disruption affecting Studio for EU communities. These appear to be intermittent issues causing some users to be unable to login to Studio, or experiencing slowness/timeouts. Our next update will be in 30 minutes.

Platform Service Degradation - Scheduled Campaigns Not Publishing Or Delayed

Description: **Summary:** On March 15th, 2024, we started receiving reports where scheduled campaigns experienced delays in publishing at the scheduled time or did not publish at all at the scheduled time. **Impact:** The impact was restricted to any scheduled campaigns on the FirstUp platform scheduled to publish on March 15th, 2024, between 1:00 AM ET \(05:00 UTC\) and 8:04 PM ET \(March 16th, 2024 - 00:04 UTC\). **Root Cause:** The root cause was determined to be a regression to a software change to the “scheduled campaign callback service” that was deployed during our scheduled software release window the previous day causing a callback to the “scheduling service” \(to publish a scheduled campaign at the scheduled time\) to fail. **Mitigation:** A hotfix was deployed by 8:04 PM ET \(March 16th, 2024 - 00:04 UTC\) to address the software regression introduced in the campaign scheduling software. Any delayed scheduled campaigns were also manually published by the same time. **Recurrence Prevention:** The Incident Response Team has taken the following actions in an effort to prevent a recurrence of this incident: * Implemented additional pre-release regression testing around the “scheduling service”. * Documented the SQL rake task used to identify any failed/delayed scheduled campaigns in a runbook to aid in quickly mitigating any future similar incidents. * Created monitors to alert us on the first instance of a failed/delayed scheduled campaign to enable us to proactively get ahead of any campaign scheduling issue\(s\) and prevent similar platform-wide incidents.

Status: Postmortem

Impact: None | Started At: March 15, 2024, 8:19 p.m.

Updates:

Time: April 10, 2024, 7:46 p.m.

Status: Postmortem

Update: **Summary:** On March 15th, 2024, we started receiving reports where scheduled campaigns experienced delays in publishing at the scheduled time or did not publish at all at the scheduled time. **Impact:** The impact was restricted to any scheduled campaigns on the FirstUp platform scheduled to publish on March 15th, 2024, between 1:00 AM ET \(05:00 UTC\) and 8:04 PM ET \(March 16th, 2024 - 00:04 UTC\). **Root Cause:** The root cause was determined to be a regression to a software change to the “scheduled campaign callback service” that was deployed during our scheduled software release window the previous day causing a callback to the “scheduling service” \(to publish a scheduled campaign at the scheduled time\) to fail. **Mitigation:** A hotfix was deployed by 8:04 PM ET \(March 16th, 2024 - 00:04 UTC\) to address the software regression introduced in the campaign scheduling software. Any delayed scheduled campaigns were also manually published by the same time. **Recurrence Prevention:** The Incident Response Team has taken the following actions in an effort to prevent a recurrence of this incident: * Implemented additional pre-release regression testing around the “scheduling service”. * Documented the SQL rake task used to identify any failed/delayed scheduled campaigns in a runbook to aid in quickly mitigating any future similar incidents. * Created monitors to alert us on the first instance of a failed/delayed scheduled campaign to enable us to proactively get ahead of any campaign scheduling issue\(s\) and prevent similar platform-wide incidents.
Time: March 19, 2024, 7:58 p.m.

Status: Resolved

Update: This incident has been fully resolved and all components remain fully operational.
Time: March 16, 2024, 12:15 a.m.

Status: Monitoring

Update: A fix has been developed and deployed to mitigate this service degradation. We have also manually published any impacted scheduled campaigns to this point, if they were not duplicated or manually published by the customer. We will place the affected services under monitoring for now.
Time: March 15, 2024, 11:23 p.m.

Status: Identified

Update: We continue to work on a solution to the potential root cause of this service degradation. We have also manually published any impacted scheduled campaigns to this point, if they were not duplicated or manually published by the customer. Another update will be provided within 1 hour.
Time: March 15, 2024, 10:02 p.m.

Status: Identified

Update: We continue to work on a solution to the potential root cause of this service degradation. We have also manually published any impacted scheduled campaigns to this point, if they were not duplicated or manually published by the customer. Another update will be provided within 1 hour.
Time: March 15, 2024, 9:22 p.m.

Status: Identified

Update: We have identified a potential backend issue that may be the root cause of this service degradation, and are working to resolve it. Another update will be provided within 1 hour.
Time: March 15, 2024, 9:15 p.m.

Status: Investigating

Update: We continue to investigate the cause of this service degradation, and will provide another update within 1 hour.
Time: March 15, 2024, 8:19 p.m.

Status: Investigating

Update: We are currently investigating reports where some scheduled campaigns did not publish at the scheduled time or were delayed in publishing. We will provide an update within 1 hour.

Platform Service Degradation - Scheduled Campaigns Not Publishing At Scheduled Time

Description: ## Summary: On February 28th, 2024, starting at around 1:11 PM PT \(18:11 UTC\), we started receiving reports that some users had not received an email from a scheduled campaign, and subsequently additional reports on February 29th, 2024, where some scheduled campaigns were still showing in the scheduled folder in Studio past their scheduled publish time. ## Impact: Impact was primarily related to campaigns that were scheduled to publish between 02.28.2024 at 11:16 AM ET and 02.29.2024 at 1:06 PM ET. ## Root Cause: The root cause was determined to be memory exhaustion in our core database on 02.28.2024 at 11:16 AM ET, which triggered an automatic database failover by AWS infrastructure failure service. Post-failover, dependent services that manage scheduled campaigns did not automatically reconnect to the failover database, and therefore could not initiate a “publish” event for scheduled campaigns at the scheduled time. ## Mitigation: The immediate problem was mitigated by querying the database for past-due scheduled campaigns and manually publishing them. Additionally, the services responsible for scheduled campaigns were manually restarted to establish connections to the failover database, in effect allowing them to initiate “publish” events for scheduled campaigns as expected. ## Recurrence Prevention: An incident response team post-mortem meeting revealed the following as recurrence prevention measures to be taken: ● Removal of SQL comments to reduce database memory consumption. ● Increase database instance size by upgrading the Postgres version. ● Improve monitoring and alerting on database connections and memory usage using dedicated dashboards that include links to runbooks and mitigation instructions. ● Fix failover and error handling in the affected services.

Status: Postmortem

Impact: None | Started At: Feb. 29, 2024, 4:11 p.m.

Updates:

Time: March 21, 2024, 4:09 p.m.

Status: Postmortem

Update: ## Summary: On February 28th, 2024, starting at around 1:11 PM PT \(18:11 UTC\), we started receiving reports that some users had not received an email from a scheduled campaign, and subsequently additional reports on February 29th, 2024, where some scheduled campaigns were still showing in the scheduled folder in Studio past their scheduled publish time. ## Impact: Impact was primarily related to campaigns that were scheduled to publish between 02.28.2024 at 11:16 AM ET and 02.29.2024 at 1:06 PM ET. ## Root Cause: The root cause was determined to be memory exhaustion in our core database on 02.28.2024 at 11:16 AM ET, which triggered an automatic database failover by AWS infrastructure failure service. Post-failover, dependent services that manage scheduled campaigns did not automatically reconnect to the failover database, and therefore could not initiate a “publish” event for scheduled campaigns at the scheduled time. ## Mitigation: The immediate problem was mitigated by querying the database for past-due scheduled campaigns and manually publishing them. Additionally, the services responsible for scheduled campaigns were manually restarted to establish connections to the failover database, in effect allowing them to initiate “publish” events for scheduled campaigns as expected. ## Recurrence Prevention: An incident response team post-mortem meeting revealed the following as recurrence prevention measures to be taken: ● Removal of SQL comments to reduce database memory consumption. ● Increase database instance size by upgrading the Postgres version. ● Improve monitoring and alerting on database connections and memory usage using dedicated dashboards that include links to runbooks and mitigation instructions. ● Fix failover and error handling in the affected services.
Time: March 7, 2024, 5:35 p.m.

Status: Resolved

Update: This service degradation is now considered as resolved, and all impacted services have remained available and stable.
Time: Feb. 29, 2024, 6:08 p.m.

Status: Monitoring

Update: We have identified a potential issue that caused some scheduled campaigns not to publish at the scheduled time. This only affected campaigns that were scheduled at a specific moment in time, and those campaigns have manually been published. Any campaigns scheduled to publish from now on, should not experience any issues, and should publish at the scheduled time. We will provide additional details in our postmortem to this service degradation. This incident is now considered mitigated.
Time: Feb. 29, 2024, 5:31 p.m.

Status: Investigating

Update: We continue to investigate the cause of this service degradation, and will provide another update within 1 hour.
Time: Feb. 29, 2024, 4:34 p.m.

Status: Investigating

Update: We have manually published any scheduled campaigns if they were scheduled on or after 2/28/2024, but did not publish at the expected time. We continue to investigate the cause of this service degradation, and will provide another update within 1 hour.
Time: Feb. 29, 2024, 4:11 p.m.

Status: Investigating

Update: We are currently investigating reports where some scheduled campaigns did not publish at the schedule time. We will provide an update within 1 hour.

Platform Services Degradation - Microapps and Partner APIs Unavailable

Description: ## Summary: On February 15th, 2024, beginning at approximately 5:50 AM PT \(13:50 UTC\), we started receiving reports of several platform services being unavailable, including Microapps and Partner APIs. Errors persisted intermittently for just over an hour primarily for these two services as well as any new user requests that required IP address resolution through an authoritative DNS \(domain name service\) server. ## Impact: The impact was primarily related to services which have very low TTL \(time to live\) thresholds for DNS and new end-user requests that required a new DNS lookup first. Observed error conditions included request timeouts and HTTP 500 gateway errors. Multiple services were in scope of the platform incident and availability would have depended on whether the service IP had been cached locally or whether the DNS request was able to be serviced within the lower level of available capacity. ## Root Cause: The root cause was determined to be an unexpected drop in overall DNS service capacity. An earlier planned maintenance regressed an earlier performance improvement that resulted in the reduction of the number of Core DNS services that would run in production, thus limiting the overall available capacity for inbound DNS traffic. ## Mitigation: The immediate problem was mitigated by restoring Core DNS capacity as soon as the discrepancy was discovered at 6:30 AM PT \(14:30 UTC\) by the incident response team. Remaining error rates began to improve markedly by 6:45 AM PT \(14:45 UTC\) and all services were confirmed to be fully stabilized by 7:15 AM PT \(15:15 UTC\). ## Recurrence Prevention: A technical team postmortem meeting reviewed the change management process that allowed an errant default setting for the number of DNS nodes to be pushed to production, how to improve platform alert visibility of this condition in the future, and how to prevent unexpected loss of DNS service capacity. The following changes have since been instituted: ‌ * An alert will now fire any time the core DNS capacity drops below the minimal viable threshold determined by Site Reliability Engineering. * All core service nodes will now launch with an attached DNS service component automatically. * Load testing has been performed to ensure scalability and appropriate buffer for potential spikes and organic growth in DNS request volume. * Updated infrastructure change management to ensure that any future configuration changes would persist following service restarts.

Status: Postmortem

Impact: None | Started At: Feb. 15, 2024, 2:28 p.m.

Updates:

Time: March 15, 2024, 5:41 p.m.

Status: Postmortem

Update: ## Summary: On February 15th, 2024, beginning at approximately 5:50 AM PT \(13:50 UTC\), we started receiving reports of several platform services being unavailable, including Microapps and Partner APIs. Errors persisted intermittently for just over an hour primarily for these two services as well as any new user requests that required IP address resolution through an authoritative DNS \(domain name service\) server. ## Impact: The impact was primarily related to services which have very low TTL \(time to live\) thresholds for DNS and new end-user requests that required a new DNS lookup first. Observed error conditions included request timeouts and HTTP 500 gateway errors. Multiple services were in scope of the platform incident and availability would have depended on whether the service IP had been cached locally or whether the DNS request was able to be serviced within the lower level of available capacity. ## Root Cause: The root cause was determined to be an unexpected drop in overall DNS service capacity. An earlier planned maintenance regressed an earlier performance improvement that resulted in the reduction of the number of Core DNS services that would run in production, thus limiting the overall available capacity for inbound DNS traffic. ## Mitigation: The immediate problem was mitigated by restoring Core DNS capacity as soon as the discrepancy was discovered at 6:30 AM PT \(14:30 UTC\) by the incident response team. Remaining error rates began to improve markedly by 6:45 AM PT \(14:45 UTC\) and all services were confirmed to be fully stabilized by 7:15 AM PT \(15:15 UTC\). ## Recurrence Prevention: A technical team postmortem meeting reviewed the change management process that allowed an errant default setting for the number of DNS nodes to be pushed to production, how to improve platform alert visibility of this condition in the future, and how to prevent unexpected loss of DNS service capacity. The following changes have since been instituted: ‌ * An alert will now fire any time the core DNS capacity drops below the minimal viable threshold determined by Site Reliability Engineering. * All core service nodes will now launch with an attached DNS service component automatically. * Load testing has been performed to ensure scalability and appropriate buffer for potential spikes and organic growth in DNS request volume. * Updated infrastructure change management to ensure that any future configuration changes would persist following service restarts.
Time: Feb. 20, 2024, 5:47 p.m.

Status: Resolved

Update: This incident has been resolved.
Time: Feb. 15, 2024, 3:21 p.m.

Status: Monitoring

Update: All affected services have now been restored and are confirmed as available. We will now be placing these services under monitoring.
Time: Feb. 15, 2024, 2:48 p.m.

Status: Identified

Update: We identified a potential issue with the capacity of our core DNS. We have increased the number POD's to service the traffic level, and are seeing indication that performance is trending to recovering. Another update will be provided within 1 hour.
Time: Feb. 15, 2024, 2:28 p.m.

Status: Investigating

Update: We are investigating reports of Microapps and Partner APIs being unavailable. We will provide an update within 1 hour.

Is there an Firstup outage?

Firstup status: Systems Active

Firstup outages and incidents

There have been 2 outages or incidents for Firstup in the last 30 days.

Severity Breakdown:

Tired of searching for status updates?

Components and Services Monitored for Firstup

Latest Firstup outages and incidents.

Platform Service Degradation - Campaign Email Delivery Delayed

Updates:

EU Communities - Studio Service Disruption

Updates:

Platform Service Degradation - Scheduled Campaigns Not Publishing Or Delayed

Updates:

Platform Service Degradation - Scheduled Campaigns Not Publishing At Scheduled Time

Updates:

Platform Services Degradation - Microapps and Partner APIs Unavailable

Updates:

Check the status of similar companies and alternatives to Firstup

Akamai

Nutanix

MongoDB

LogicMonitor

Acquia

Granicus System

CareCloud

Redis

integrator.io

NinjaOne Trust

Pantheon Operations

Securiti US

Frequently Asked Questions - Firstup

Is there a Firstup outage?

Where can I find the official status page of Firstup?

How can I get notified if Firstup is down or experiencing an outage?

Start monitoring now!