Firstup Status: Check if Firstup down or having an outage.

Platform Service Degradation - Email Delivery Delays

Description: ## **Summary:** On Tuesday May 14th, 2024, starting at approximately 4 PM PT to Wednesday May 15th, 2024 10:06 AM PT, outbound email that had previously been sent from [sendgrid.net](http://sendgrid.net), a backstop sender domain, instead was sent via another non-allowlisted domain. This resulted in delivery failures or delays for some Customers with explicit inbound email rules configured to explicitly match the [sendgrid.net](http://sendgrid.net) sender domain. ## **Impact:** Any campaigns set to publish or re-engage users during the affected time window, would have appeared to originate from a sender domain other than [sendgrid.net](http://sendgrid.net) for any Customers without their own authenticated sender domain records \(against Firstup and industry best practices\). Depending on individual email rules, those emails could have been blocked, quarantined, marked as spam, or any number of other email policy-specific actions. ## **Root Cause:** Root cause was determined to be related to a default sender domain being set while configuring a new authenticated email domain while implementing a newly onboarded Firstup customer. Previously no default domain was configured which resulted in [sendgrid.net](http://sendgrid.net) being used as a backstop sender domain–despite not being included in the allowlisting article at [https://support.firstup.io/hc/en-us/articles/4417455533975-Allowlist-Emails-from-Firstup](https://support.firstup.io/hc/en-us/articles/4417455533975-Allowlist-Emails-from-Firstup) .Both user error and a UI design limitation in the 3rd party software used to add new authenticated domains contributed to the errantly created default sender domain. Specifically, the configuration page does not have any explicit save button or confirmation dialogue protecting the checkbox which sets the new record as the default sender domain to use platform-wide. The checkbox was incidentally selected while creating a screenshot of the configuration settings to be shared with the newly onboarded customer. ‌ ## **Mitigation:** To mitigate, the default sender domain was restored as soon as the root cause became clear. ## **Recurrence Prevention:** The following actions have been committed to fully resolving the incident and eliminating the reliance on the mitigation measure currently in place. * A scheduled maintenance window has been posted for June 15th, 2024 outlining a planned update to a new sender domain that should already be allowlisted, [email.socialchorus.net](http://email.socialchorus.net). Specific details of the maintenance can be found at [https://status.firstup.io/incidents/jfv1s06qyv3v](https://status.firstup.io/incidents/jfv1s06qyv3v) * Firstup will review again any Customers who do not have authenticated sender domains configured to setup DMARC/SPF records and Customer specific sender domains to avoid the backstop or default from ever being needed to send program-specific email. * A feature request has been filed with Sendgrid, the 3rd party email provider to add protections around the checkbox selection in the user interface to avoid any chance of an unintentional user action from changing the default sender domain.

Status: Postmortem

Impact: None | Started At: May 14, 2024, 11 p.m.

Updates:

Time: May 17, 2024, 10:39 p.m.

Status: Postmortem

Update: ## **Summary:** On Tuesday May 14th, 2024, starting at approximately 4 PM PT to Wednesday May 15th, 2024 10:06 AM PT, outbound email that had previously been sent from [sendgrid.net](http://sendgrid.net), a backstop sender domain, instead was sent via another non-allowlisted domain. This resulted in delivery failures or delays for some Customers with explicit inbound email rules configured to explicitly match the [sendgrid.net](http://sendgrid.net) sender domain. ## **Impact:** Any campaigns set to publish or re-engage users during the affected time window, would have appeared to originate from a sender domain other than [sendgrid.net](http://sendgrid.net) for any Customers without their own authenticated sender domain records \(against Firstup and industry best practices\). Depending on individual email rules, those emails could have been blocked, quarantined, marked as spam, or any number of other email policy-specific actions. ## **Root Cause:** Root cause was determined to be related to a default sender domain being set while configuring a new authenticated email domain while implementing a newly onboarded Firstup customer. Previously no default domain was configured which resulted in [sendgrid.net](http://sendgrid.net) being used as a backstop sender domain–despite not being included in the allowlisting article at [https://support.firstup.io/hc/en-us/articles/4417455533975-Allowlist-Emails-from-Firstup](https://support.firstup.io/hc/en-us/articles/4417455533975-Allowlist-Emails-from-Firstup) .Both user error and a UI design limitation in the 3rd party software used to add new authenticated domains contributed to the errantly created default sender domain. Specifically, the configuration page does not have any explicit save button or confirmation dialogue protecting the checkbox which sets the new record as the default sender domain to use platform-wide. The checkbox was incidentally selected while creating a screenshot of the configuration settings to be shared with the newly onboarded customer. ‌ ## **Mitigation:** To mitigate, the default sender domain was restored as soon as the root cause became clear. ## **Recurrence Prevention:** The following actions have been committed to fully resolving the incident and eliminating the reliance on the mitigation measure currently in place. * A scheduled maintenance window has been posted for June 15th, 2024 outlining a planned update to a new sender domain that should already be allowlisted, [email.socialchorus.net](http://email.socialchorus.net). Specific details of the maintenance can be found at [https://status.firstup.io/incidents/jfv1s06qyv3v](https://status.firstup.io/incidents/jfv1s06qyv3v) * Firstup will review again any Customers who do not have authenticated sender domains configured to setup DMARC/SPF records and Customer specific sender domains to avoid the backstop or default from ever being needed to send program-specific email. * A feature request has been filed with Sendgrid, the 3rd party email provider to add protections around the checkbox selection in the user interface to avoid any chance of an unintentional user action from changing the default sender domain.
Time: May 15, 2024, 5:17 p.m.

Status: Resolved

Update: We are currently investigating reports of delayed email deliveries.

Platform Service Degradation - Studio Performance Degraded or Inaccessible

Description: **Summary:** On Tuesday, May 14th, 2024, starting at around 11:25 AM EDT \(15:25 UTC\), we received reports that some users saw errors while accessing the Studio platform or the Web Experience. Reported error messages included: * We’re sorry, but something went wrong. * 502 Bad Gateway. * There was an error processing your request. Please try again. **Scope:** The scope of this incident primarily affected users who attempted to access Studio services, and to a lesser degree, users who tried to access the Web Experience between 11:25 AM EDT and 12:09 PM EDT. **Root Cause:** An underlying service \(Athena\) which is used as part of our machine learning and AI infrastructure experienced access issues connecting with one of our core database servers due to high network latency. The service had timeouts configured that were too large for its access pattern and the data it uses, causing it to block incoming connections for an inordinate period. Subsequently, services that depend on Athena also timed out, resulting in the Studio service degradation and error messages observed by impacted users. **Mitigation:** The immediate impact was mitigated by performing a rolling restart of the affected services, and all Studio functions were restored by 12:09 PM EDT \(16:09 UTC\). **Recurrence Prevention:** To prevent a recurrence of this incident, connection requests Time-To-Live \(TTL\) from Athena to our core database will be reduced from the default 60 seconds to 5 seconds. This will greatly reduce the traffic backup of requests from other services to Athena.

Status: Postmortem

Impact: None | Started At: May 14, 2024, 3:56 p.m.

Updates:

Time: June 10, 2024, 11:32 p.m.

Status: Postmortem

Update: **Summary:** On Tuesday, May 14th, 2024, starting at around 11:25 AM EDT \(15:25 UTC\), we received reports that some users saw errors while accessing the Studio platform or the Web Experience. Reported error messages included: * We’re sorry, but something went wrong. * 502 Bad Gateway. * There was an error processing your request. Please try again. **Scope:** The scope of this incident primarily affected users who attempted to access Studio services, and to a lesser degree, users who tried to access the Web Experience between 11:25 AM EDT and 12:09 PM EDT. **Root Cause:** An underlying service \(Athena\) which is used as part of our machine learning and AI infrastructure experienced access issues connecting with one of our core database servers due to high network latency. The service had timeouts configured that were too large for its access pattern and the data it uses, causing it to block incoming connections for an inordinate period. Subsequently, services that depend on Athena also timed out, resulting in the Studio service degradation and error messages observed by impacted users. **Mitigation:** The immediate impact was mitigated by performing a rolling restart of the affected services, and all Studio functions were restored by 12:09 PM EDT \(16:09 UTC\). **Recurrence Prevention:** To prevent a recurrence of this incident, connection requests Time-To-Live \(TTL\) from Athena to our core database will be reduced from the default 60 seconds to 5 seconds. This will greatly reduce the traffic backup of requests from other services to Athena.
Time: May 28, 2024, 5:02 p.m.

Status: Resolved

Update: This incident has been resolved.
Time: May 14, 2024, 4:08 p.m.

Status: Monitoring

Update: We have rolled the affected services to restore functionality and will continue to monitor these services for stability.
Time: May 14, 2024, 3:56 p.m.

Status: Investigating

Update: We are currently investigating reports of Studio performing poorly or returning 5xx errors for some users. We will provide you with an update in 1 hour.

Platform Service Degradation - Studio Malfunctions

Description: **Summary:** On May 2nd, 2024, at 10:39 AM EDT, a system monitor alerted us of a potential issue where the disk space on a service used to pass messages between backend workers was approaching critical “free disk space limits”. As we started looking at the event condition, customer reports of various Studio functions experiencing issues started coming in, including but not limited to the following conditions: * Unable to send test campaigns * Processing error messages * Test campaign emails are not being delivered * White screens * Studio loading issues A platform incident was declared at 12:41 PM EDT, and the incident response team was engaged to diagnose the reported issues. ‌ **Impact:** The impact was determined to affect all Studio users who attempted to connect to Studio or initiate new Studio activities. ‌ **Root Cause:** The incident response team identified that one of the queues in the impacted service was backed up, in effect utilizing too much memory, which led to the out-of-memory condition. As a result, new Studio service requests could not establish connections to this service. The inability to establish connections to the service presented itself as the aforementioned customer-reported issues. **Mitigation:** To restore Studio services, the backed-up queue was purged at around 1:00 PM EDT to free up memory, which increased the available disk space for the service. This allowed for other queues to continue processing, as well as new Studio service requests to gain a connection to the service, and process successfully. For any affected transactions that were stuck during the purge, such as scheduled campaigns that did not publish, these were manually published. No customer data was lost from purging the queue. **Recurrence Prevention:** To prevent a recurrence of this incident, we have since deployed a hotfix to the code that checks if the queue size is over a certain limit before queueing more messages, to prevent this exact out-of-memory failure scenario.

Status: Postmortem

Impact: None | Started At: May 2, 2024, 4:50 p.m.

Updates:

Time: May 17, 2024, 11:54 p.m.

Status: Postmortem

Update: **Summary:** On May 2nd, 2024, at 10:39 AM EDT, a system monitor alerted us of a potential issue where the disk space on a service used to pass messages between backend workers was approaching critical “free disk space limits”. As we started looking at the event condition, customer reports of various Studio functions experiencing issues started coming in, including but not limited to the following conditions: * Unable to send test campaigns * Processing error messages * Test campaign emails are not being delivered * White screens * Studio loading issues A platform incident was declared at 12:41 PM EDT, and the incident response team was engaged to diagnose the reported issues. ‌ **Impact:** The impact was determined to affect all Studio users who attempted to connect to Studio or initiate new Studio activities. ‌ **Root Cause:** The incident response team identified that one of the queues in the impacted service was backed up, in effect utilizing too much memory, which led to the out-of-memory condition. As a result, new Studio service requests could not establish connections to this service. The inability to establish connections to the service presented itself as the aforementioned customer-reported issues. **Mitigation:** To restore Studio services, the backed-up queue was purged at around 1:00 PM EDT to free up memory, which increased the available disk space for the service. This allowed for other queues to continue processing, as well as new Studio service requests to gain a connection to the service, and process successfully. For any affected transactions that were stuck during the purge, such as scheduled campaigns that did not publish, these were manually published. No customer data was lost from purging the queue. **Recurrence Prevention:** To prevent a recurrence of this incident, we have since deployed a hotfix to the code that checks if the queue size is over a certain limit before queueing more messages, to prevent this exact out-of-memory failure scenario.
Time: May 17, 2024, 11:54 p.m.

Status: Resolved

Update: This incident has been resolved.
Time: May 2, 2024, 5:05 p.m.

Status: Monitoring

Update: This service degradation has been mitigated, and are working to identify the root cause. Studio and its functions are now available. Please note that there may be some slight delays in campaign deliveries as tasks catch up within our databases.
Time: May 2, 2024, 4:50 p.m.

Status: Investigating

Update: We are currently investigating reports of various Studio functionalities not performing as expected. We will provide you with another update in 30 minutes.

Platform Service Disruption - User Sync Files Failing To Process

Description: **Summary:** On April 30th, 2024, starting at 8:18 AM EDT, we began to receive reports that User Sync files were failing to process, and the following error message was returned: * Failed to decrypt uploaded file. Please ensure that the correct encryption key and format is used. * The encryption key expected to be used is \[Key Fingerprint\]. A platform incident was declared at 10:23 AM EDT and was fully mitigated by 10:59 AM EDT. **Scope and Impact:** The scope of this incident was isolated to only customers who encrypt their User Sync file before uploading it. The impact of this incident was restricted to customers who had uploaded an encrypted User Sync file between 10:03 PM EDT on April 29th, 2024, and 10:59 AM EDT on April 30th, 2024. **Root Cause:** The incident response team identified that this incident resulted from a regression to a software release on April 29th, 2024. It was identified that the OS image used to deploy the upgrade lacked crucial packages for decryption. **Mitigation:** At 10:59 AM EDT, the released upgrade was rolled back to its previous version which contained the decryption packages, to allow normal decryption of encrypted User Sync files. We also identified and reprocessed any encrypted customer User Sync files that had failed to process within the duration of the incident. **Recurrence Prevention:** A technical team post-mortem meeting reviewed that the zip-based deployment of the OS had no controls over updating or re-deploying the upgrade. We therefore transitioned to an image-based deployment which allowed for greater control over the OS image and the necessary dependencies. The upgrade was later redeployed on May 6th, 2024, using the OS image that included the necessary decryption packages. We also: * Added additional monitoring and alerting for the health of external-registration \(User Sync files processing\). * Updated regression test packs to include testing user sync with encrypted files.

Status: Postmortem

Impact: None | Started At: April 30, 2024, 2:32 p.m.

Updates:

Time: May 28, 2024, 4:28 p.m.

Status: Postmortem

Update: **Summary:** On April 30th, 2024, starting at 8:18 AM EDT, we began to receive reports that User Sync files were failing to process, and the following error message was returned: * Failed to decrypt uploaded file. Please ensure that the correct encryption key and format is used. * The encryption key expected to be used is \[Key Fingerprint\]. A platform incident was declared at 10:23 AM EDT and was fully mitigated by 10:59 AM EDT. **Scope and Impact:** The scope of this incident was isolated to only customers who encrypt their User Sync file before uploading it. The impact of this incident was restricted to customers who had uploaded an encrypted User Sync file between 10:03 PM EDT on April 29th, 2024, and 10:59 AM EDT on April 30th, 2024. **Root Cause:** The incident response team identified that this incident resulted from a regression to a software release on April 29th, 2024. It was identified that the OS image used to deploy the upgrade lacked crucial packages for decryption. **Mitigation:** At 10:59 AM EDT, the released upgrade was rolled back to its previous version which contained the decryption packages, to allow normal decryption of encrypted User Sync files. We also identified and reprocessed any encrypted customer User Sync files that had failed to process within the duration of the incident. **Recurrence Prevention:** A technical team post-mortem meeting reviewed that the zip-based deployment of the OS had no controls over updating or re-deploying the upgrade. We therefore transitioned to an image-based deployment which allowed for greater control over the OS image and the necessary dependencies. The upgrade was later redeployed on May 6th, 2024, using the OS image that included the necessary decryption packages. We also: * Added additional monitoring and alerting for the health of external-registration \(User Sync files processing\). * Updated regression test packs to include testing user sync with encrypted files.
Time: May 2, 2024, 3:53 p.m.

Status: Resolved

Update: We have observed that PGP-encrypted user sync files continue to process successfully. This platform service disruption is now resolved, and an RCA will be provided once a full incident postmortem has been completed.
Time: April 30, 2024, 4:14 p.m.

Status: Monitoring

Update: All impacted user sync files have now been reprocessed successfully. Please note that only PGP-encrypted user sync files were in the scope of this incident. Additional details will be provided once a postmortem of the incident has been completed. This incident is now considered resolved. We will be placing the impacted systems under monitoring for now.
Time: April 30, 2024, 3:10 p.m.

Status: Monitoring

Update: A fix for this issue has been deployed, and user sync files are now processing successfully. We will be re-running any previously failed user sync files, and confirm once completed.
Time: April 30, 2024, 2:59 p.m.

Status: Identified

Update: We have identified the cause of this service disruption, and are working to fix it. Another update in 1 hour.
Time: April 30, 2024, 2:32 p.m.

Status: Investigating

Update: We are currently investigating reports where user sync files are failing due to PGP encrypted files failing to be processed. We will provide you with an update within 1 hour.

Platform Performace Degradation - Intermittent 5xx errors accessing Studio

Description: ## Summary: On February 8th, 2024, beginning at approximately 1:56 PM EST \(18:56 UTC\), we started receiving reports of Studio not performing as expected. The symptoms observed by some Studio users included: · A “failed to fetch” or a “504 Gateway Timeout” error message. · Unusually slow performance. A recurrence of this incident was also observed on April 24th, 2024. ## Impact: Studio users who were actively trying to navigate through and use any Studio functions during the duration of these incidents were impacted by the service disruption. ## Root Cause: It was identified that Studio services were failing to establish a TCP connection to the Identity and Access Management service \(IAM\) due to a backup of TCP connection requests. The backup of TCP connection requests resulted from other “already failed” connection requests that were not dropped because they kept retrying to establish a connection for an extended period. ## Mitigation: On both days, the immediate problem was mitigated by restarting the backend services that had failed TCP connection attempts, in effect purging the connection request queue of stale connections and allowing new connections to be established with the IAM service. ## Remediation Steps: Our engineering team is working on reducing the time-to-live duration of all TCP connection requests to the IAM service from the default 60 seconds to 10 seconds. This will allow for failed connections to be dropped sooner and reduce the backup of connection requests to IAM. In addition, we have also implemented dashboards to track TCP connection failures, as well as set alerting thresholds on failed TCP connections to help us get ahead of a potential platform service disruption.

Status: Postmortem

Impact: None | Started At: April 24, 2024, 10:57 p.m.

Updates:

Time: June 7, 2024, 7:56 p.m.

Status: Postmortem

Update: ## Summary: On February 8th, 2024, beginning at approximately 1:56 PM EST \(18:56 UTC\), we started receiving reports of Studio not performing as expected. The symptoms observed by some Studio users included: · A “failed to fetch” or a “504 Gateway Timeout” error message. · Unusually slow performance. A recurrence of this incident was also observed on April 24th, 2024. ## Impact: Studio users who were actively trying to navigate through and use any Studio functions during the duration of these incidents were impacted by the service disruption. ## Root Cause: It was identified that Studio services were failing to establish a TCP connection to the Identity and Access Management service \(IAM\) due to a backup of TCP connection requests. The backup of TCP connection requests resulted from other “already failed” connection requests that were not dropped because they kept retrying to establish a connection for an extended period. ## Mitigation: On both days, the immediate problem was mitigated by restarting the backend services that had failed TCP connection attempts, in effect purging the connection request queue of stale connections and allowing new connections to be established with the IAM service. ## Remediation Steps: Our engineering team is working on reducing the time-to-live duration of all TCP connection requests to the IAM service from the default 60 seconds to 10 seconds. This will allow for failed connections to be dropped sooner and reduce the backup of connection requests to IAM. In addition, we have also implemented dashboards to track TCP connection failures, as well as set alerting thresholds on failed TCP connections to help us get ahead of a potential platform service disruption.
Time: May 2, 2024, 3:49 p.m.

Status: Resolved

Update: Studio has remained fully accessible following the bouncing of the affected services. This platform service degradation is now considered resolved, and a RCA analysis will be provided once a full incident postmortem has been completed.
Time: April 24, 2024, 11:48 p.m.

Status: Investigating

Update: We have bounced the impacted services to mitigate this performance degradation. Studio is now accessible, as we work to identify the root cause of this incident. We will provide you with another update as soon as more information is made available.
Time: April 24, 2024, 10:57 p.m.

Status: Investigating

Update: We are currently investigating reports of intermittent 5xx errors while accessing Studio. We will provide you with an update within 1 hour.

Is there an Firstup outage?

Firstup status: Systems Active

Firstup outages and incidents

There have been 2 outages or incidents for Firstup in the last 30 days.

Severity Breakdown:

Tired of searching for status updates?

Components and Services Monitored for Firstup

Latest Firstup outages and incidents.

Platform Service Degradation - Email Delivery Delays

Updates:

Platform Service Degradation - Studio Performance Degraded or Inaccessible

Updates:

Platform Service Degradation - Studio Malfunctions

Updates:

Platform Service Disruption - User Sync Files Failing To Process

Updates:

Platform Performace Degradation - Intermittent 5xx errors accessing Studio

Updates:

Check the status of similar companies and alternatives to Firstup

Akamai

Nutanix

MongoDB

LogicMonitor

Acquia

Granicus System

CareCloud

Redis

integrator.io

NinjaOne Trust

Pantheon Operations

Securiti US

Frequently Asked Questions - Firstup

Is there a Firstup outage?

Where can I find the official status page of Firstup?

How can I get notified if Firstup is down or experiencing an outage?

Start monitoring now!