Firstup Status: Check if Firstup down or having an outage.

Platform Service Degradation - Intermittent Studio Slow Performance

Description: ## Summary: On Wednesday, November 13th, 2024, starting at around 6:14 AM PT, we received reports of some users experiencing a general slowness and decreased responsiveness while navigating around Firstup’s Studio platform. Additional reports of some published campaign emails not being delivered, and error messages being returned while navigating within the Employee Experience followed shortly thereafter. ## Severity Sev2 ## Scope: The scope of this service degradation included all Firstup customers utilizing the Studio and the Employee Experience \(EE\) endpoints. ## Impact: For approximately 2.5 hours after the onset of the incident, users were unable to efficiently and consistently navigate around Studio and the EE. Symptoms included a loading spinner for an extended amount of time in Studio, and missing shortcuts in EE, with some of the errors noted below being returned: * Oops, an error occurred! Sorry about that. * TypeError: toMoment\(…\).format is not a function Concurrently, campaign emails were also not being delivered consistently to their intended recipients, and user sync files were not being actively processed. Most campaign emails were delivered over the course of the next 3 hours with a few rare exceptions that required manual intervention and Customer coordination. User sync was not restored until it was discovered the process was not actively running the following day. Total incident duration less the user sync process was just over 6 hours. ## Root Cause: The root cause has been attributed to a code change that was released during the Platform Software Release maintenance window the night before, which caused a slow-running query from one of our back-end services to run for too long and utilize an exceptionally large amount of database resources including the number of network connections and CPU processing cycles. This, coupled with the normal increase in database requests from our customer base during our platform utilization peak hours starting at around 5:54 AM PT, resulted in available database connections to be exhausted and CPU overutilization conditions in the database. As a result, new connections to the database could not be freely established until current connections were closed and made available for new requests from various platform services such as Studio and EE. The backend service responsible for campaign email delivery was also subject to this condition and could not process email deliveries as expected. Similarly, user sync file processing was also delayed beyond normal limits. ## Mitigation: Various symptoms exhibited during the course of the incident were mitigated in phases. The most significant service impact was mitigated after a hotfix was released at 8:29 AM PT to halt the aforementioned slow-running query, in effect relieving some resource pressure on the database, and allowing customer-facing service requests from Studio and EE to successfully re-establish connections with the database. Additional resources were also spun up to process the email delivery queue backlog that had been increasing during the incident, which started draining at 12:02 PM PT. Almost all campaign emails were confirmed delivered by 1:10 PM PT, with the exception of ~150 email messages, which had experienced an internal error, but were later cleared and delivered. ## Recurrence Prevention: The following actions have been taken or have been identified as follow-up actions to commit to as a part of the formal RCA \(Root Cause Assessment\) process: * Moved the slow-running query to run during after-hours \(off customer peak hours\). * Isolate the campaign email delivery back-end service to its own database cluster to avoid the general database impact on email deliveries. * Enhance our software change management policies to release risky backend changes behind a feature flag for a more controlled release. * Update post incident service verification to ensure that user sync processing has been fully restored and remains functional.

Status: Postmortem

Impact: None | Started At: Nov. 13, 2024, 3:07 p.m.

Updates:

Time: Nov. 16, 2024, 2:03 a.m.

Status: Postmortem

Update: ## Summary: On Wednesday, November 13th, 2024, starting at around 6:14 AM PT, we received reports of some users experiencing a general slowness and decreased responsiveness while navigating around Firstup’s Studio platform. Additional reports of some published campaign emails not being delivered, and error messages being returned while navigating within the Employee Experience followed shortly thereafter. ## Severity Sev2 ## Scope: The scope of this service degradation included all Firstup customers utilizing the Studio and the Employee Experience \(EE\) endpoints. ## Impact: For approximately 2.5 hours after the onset of the incident, users were unable to efficiently and consistently navigate around Studio and the EE. Symptoms included a loading spinner for an extended amount of time in Studio, and missing shortcuts in EE, with some of the errors noted below being returned: * Oops, an error occurred! Sorry about that. * TypeError: toMoment\(…\).format is not a function Concurrently, campaign emails were also not being delivered consistently to their intended recipients, and user sync files were not being actively processed. Most campaign emails were delivered over the course of the next 3 hours with a few rare exceptions that required manual intervention and Customer coordination. User sync was not restored until it was discovered the process was not actively running the following day. Total incident duration less the user sync process was just over 6 hours. ## Root Cause: The root cause has been attributed to a code change that was released during the Platform Software Release maintenance window the night before, which caused a slow-running query from one of our back-end services to run for too long and utilize an exceptionally large amount of database resources including the number of network connections and CPU processing cycles. This, coupled with the normal increase in database requests from our customer base during our platform utilization peak hours starting at around 5:54 AM PT, resulted in available database connections to be exhausted and CPU overutilization conditions in the database. As a result, new connections to the database could not be freely established until current connections were closed and made available for new requests from various platform services such as Studio and EE. The backend service responsible for campaign email delivery was also subject to this condition and could not process email deliveries as expected. Similarly, user sync file processing was also delayed beyond normal limits. ## Mitigation: Various symptoms exhibited during the course of the incident were mitigated in phases. The most significant service impact was mitigated after a hotfix was released at 8:29 AM PT to halt the aforementioned slow-running query, in effect relieving some resource pressure on the database, and allowing customer-facing service requests from Studio and EE to successfully re-establish connections with the database. Additional resources were also spun up to process the email delivery queue backlog that had been increasing during the incident, which started draining at 12:02 PM PT. Almost all campaign emails were confirmed delivered by 1:10 PM PT, with the exception of ~150 email messages, which had experienced an internal error, but were later cleared and delivered. ## Recurrence Prevention: The following actions have been taken or have been identified as follow-up actions to commit to as a part of the formal RCA \(Root Cause Assessment\) process: * Moved the slow-running query to run during after-hours \(off customer peak hours\). * Isolate the campaign email delivery back-end service to its own database cluster to avoid the general database impact on email deliveries. * Enhance our software change management policies to release risky backend changes behind a feature flag for a more controlled release. * Update post incident service verification to ensure that user sync processing has been fully restored and remains functional.
Time: Nov. 16, 2024, 2:03 a.m.

Status: Resolved

Update: All impacted Firstup platform endpoints have remained stable and fully available. This incident is now resolved.
Time: Nov. 13, 2024, 8:01 p.m.

Status: Monitoring

Update: The hotfix for the campaign email delivery issue has now been vetted and deployed in the production environment. Campaign emails have now resumed delivery and may take a few minutes to hit the end user’s inbox. The affected services will now be placed back under monitoring for now.
Time: Nov. 13, 2024, 7:48 p.m.

Status: Identified

Update: We continue working on developing and verifying another hotfix for the email delivery issue. Another update within 1 hour.
Time: Nov. 13, 2024, 6:47 p.m.

Status: Identified

Update: We have identified a potential root cause for the residual impact of this incident where email campaigns are yet to be delivered following the hotfix. We are working on developing and verifying another hotfix for that issue. Another update within 1 hour.
Time: Nov. 13, 2024, 5:50 p.m.

Status: Identified

Update: We are currently investigating a residual impact of this incident where some email campaigns are yet to be delivered after the hotfix was released. We will provide an update within 1 hour.
Time: Nov. 13, 2024, 4:35 p.m.

Status: Monitoring

Update: The hotfix for this incident has been released in the production environment. All services should now be restored. We will be placing the affected services in a monitoring state for now.
Time: Nov. 13, 2024, 4:25 p.m.

Status: Identified

Update: A hotfix for this incident has been developed and is currently being tested in our staging environment. Once vetted, it will be released in the production environment. Another update within 1 hour.
Time: Nov. 13, 2024, 3:30 p.m.

Status: Identified

Update: We have identified a potential root cause of this incident, and are working towards mitigation. Affected components are only on the US datacenter. The EU datacenter remains unaffected. Another update within 1 hour.
Time: Nov. 13, 2024, 3:07 p.m.

Status: Investigating

Update: We are currently investigating reports of intermittent Studio slow performance issues. We will provide you with an update within 1 hour.

Platform Service Degradation - Some Platform Emails Not Being Delivered

Description: **Summary:** On Monday, November 4th, 2024, starting at around 7:53 AM PT, we received reports that some published campaign emails were not being delivered to their intended audiences. While some emails were delivered as expected, others were either delayed or appeared to be stuck. All emails were handed off in a timely manner from the Firstup platform to the third-party email provider. This problem worsened over the course of the next several hours where email throughput appeared to be highly restricted, while the backlog of the email delivery queue not only continued to grow, but also did not drain in a logical chronological order in which messages were initially queued. Through a joint troubleshooting call with the third-party email provider, it was determined that a large volume of email delivery errors starting at around 7:11 AM PT had put the entire pool of our email delivery IP addresses in a state of reduced performance. After jointly reviewing the highest volume of errors with the third-party email provider, the sender IPs were restored to a fully functioning state, resulting in the entire email backlog being drained fully by 3:30 PM PT. **Severity:** Sev2 **Scope:** The scope of this service degradation was restricted to customers who use Firstup campaign email delivery as a channel, as well as any other non-campaign email content sent from the Firstup platform, such as password reset request emails. Push notifications, assistant notifications, and web or mobile experience channels were unaffected and remained fully functional. **Impact:** Within an individual campaign sent to email as a channel, some emails may not have been delivered as expected, while others wound up being stuck in a "processing" state on the third- party email delivery platform. During the incident \(7hrs 58mins\), some of the emails in the “processing” state were successfully delivered but heavily delayed, while others remained stuck in a “processing” status. Observed email throughput was reduced to approximately 46k messages per hour from a theoretical max of 30k messages per second. The total outstanding backlog prior to mitigation was over a million email messages. **Root Cause:** Root cause has been attributed to an elevated level of email delivery errors that triggered a protection mechanism on the third-party email provider platform. This resulted in reduced throughput for the entire pool of our sender IP addresses to the point where mostly retries for deferrals from earlier delivery errors were being processed, and very few queued up emails were delivered. Essentially the queue processing equivalent of running in place. **Mitigation:** After analyzing the top contributors to email delivery errors that appeared to be correlated to a single misconfigured email security endpoint all addresses associated with that endpoint were force-unsubscribed until it could be correctly configured, to avoid any further email delivery errors contributing to the underlying log jam. 80k email errors were attributed to that endpoint in just a couple of hours. Through a joint incident bridge with the third-party email provider, Firstup demonstrated the irrelevance of the deferral rates to the overall email backlog queue. A data pipeline engineer was paged out and able to verify that the sender IPs had been relegated to a lower performing state that was actually contributing to a circular problem. Backend system changes were made at 3:09 PM PT on the third-party platform to restore prior state of the sender IPs, and the entire email messages backlog subsequently fully drained in less than 25 minutes. **Recurrence Prevention:** The following actions have been taken or have been identified as follow-up actions to commit to as a part of the formal RCA \(Root Cause Assessment\) process: * Email addresses contributing to elevated error rates will be bulk unsubscribed from the platform \(or otherwise quarantined\) until underlying conditions can be corrected. * Coordinate with third-party provider to better understand the characteristics of the platform safety mechanism, including why it triggered, how to avoid it entirely, and how to improve joint monitoring and mitigate elevated error rates from affecting overall delivery. * Implement any reasonable recommendations from the third-party RFO.

Status: Postmortem

Impact: None | Started At: Nov. 4, 2024, 4:40 p.m.

Updates:

Time: Nov. 13, 2024, 1:23 a.m.

Status: Postmortem

Update: **Summary:** On Monday, November 4th, 2024, starting at around 7:53 AM PT, we received reports that some published campaign emails were not being delivered to their intended audiences. While some emails were delivered as expected, others were either delayed or appeared to be stuck. All emails were handed off in a timely manner from the Firstup platform to the third-party email provider. This problem worsened over the course of the next several hours where email throughput appeared to be highly restricted, while the backlog of the email delivery queue not only continued to grow, but also did not drain in a logical chronological order in which messages were initially queued. Through a joint troubleshooting call with the third-party email provider, it was determined that a large volume of email delivery errors starting at around 7:11 AM PT had put the entire pool of our email delivery IP addresses in a state of reduced performance. After jointly reviewing the highest volume of errors with the third-party email provider, the sender IPs were restored to a fully functioning state, resulting in the entire email backlog being drained fully by 3:30 PM PT. **Severity:** Sev2 **Scope:** The scope of this service degradation was restricted to customers who use Firstup campaign email delivery as a channel, as well as any other non-campaign email content sent from the Firstup platform, such as password reset request emails. Push notifications, assistant notifications, and web or mobile experience channels were unaffected and remained fully functional. **Impact:** Within an individual campaign sent to email as a channel, some emails may not have been delivered as expected, while others wound up being stuck in a "processing" state on the third- party email delivery platform. During the incident \(7hrs 58mins\), some of the emails in the “processing” state were successfully delivered but heavily delayed, while others remained stuck in a “processing” status. Observed email throughput was reduced to approximately 46k messages per hour from a theoretical max of 30k messages per second. The total outstanding backlog prior to mitigation was over a million email messages. **Root Cause:** Root cause has been attributed to an elevated level of email delivery errors that triggered a protection mechanism on the third-party email provider platform. This resulted in reduced throughput for the entire pool of our sender IP addresses to the point where mostly retries for deferrals from earlier delivery errors were being processed, and very few queued up emails were delivered. Essentially the queue processing equivalent of running in place. **Mitigation:** After analyzing the top contributors to email delivery errors that appeared to be correlated to a single misconfigured email security endpoint all addresses associated with that endpoint were force-unsubscribed until it could be correctly configured, to avoid any further email delivery errors contributing to the underlying log jam. 80k email errors were attributed to that endpoint in just a couple of hours. Through a joint incident bridge with the third-party email provider, Firstup demonstrated the irrelevance of the deferral rates to the overall email backlog queue. A data pipeline engineer was paged out and able to verify that the sender IPs had been relegated to a lower performing state that was actually contributing to a circular problem. Backend system changes were made at 3:09 PM PT on the third-party platform to restore prior state of the sender IPs, and the entire email messages backlog subsequently fully drained in less than 25 minutes. **Recurrence Prevention:** The following actions have been taken or have been identified as follow-up actions to commit to as a part of the formal RCA \(Root Cause Assessment\) process: * Email addresses contributing to elevated error rates will be bulk unsubscribed from the platform \(or otherwise quarantined\) until underlying conditions can be corrected. * Coordinate with third-party provider to better understand the characteristics of the platform safety mechanism, including why it triggered, how to avoid it entirely, and how to improve joint monitoring and mitigate elevated error rates from affecting overall delivery. * Implement any reasonable recommendations from the third-party RFO.
Time: Nov. 12, 2024, 5:36 p.m.

Status: Resolved

Update: Email delivery has remained available and fully functional throughout the monitoring phase of this incident. This incident is now resolved. Once available, a Root Cause Analysis for this incident will be published here.
Time: Nov. 4, 2024, 11:24 p.m.

Status: Monitoring

Update: This issue has now been mitigated, and any emails that had not been delivered should now be hitting their user endpoints. We will place the impacted services under monitoring.
Time: Nov. 4, 2024, 9:23 p.m.

Status: Identified

Update: As we continue to work with our upstream third-party email delivery vendor towards a solution to this issue, we have identified that platform emails such as password reset emails are in the scope of this incident. Another update within 1 hour.
Time: Nov. 4, 2024, 8:11 p.m.

Status: Investigating

Update: We continue to work with our upstream third-party email delivery vendor towards a solution to this issue. Another update within 1 hour.
Time: Nov. 4, 2024, 7:11 p.m.

Status: Investigating

Update: We continue to work with our upstream third-party email delivery vendor towards a solution to this issue. Another update within 1 hour.
Time: Nov. 4, 2024, 6:11 p.m.

Status: Investigating

Update: Our investigations have not revealed any issues on our platform. However, we see a potential issue with our upstream third-party email delivery vendor, and have reached out to them for additional troubleshooting on their end. We will provide you with another update within 1 hour.
Time: Nov. 4, 2024, 5:11 p.m.

Status: Investigating

Update: As we continue to investigate these reports, our current observation is that email campaigns are being delivered, albeit with some delays. Another update within 1 hour.
Time: Nov. 4, 2024, 4:40 p.m.

Status: Investigating

Update: We are investigating reports where some email campaigns are not being delivered as expected. We will provide you with an update within 1 hour.

Platform Service Disruption - SSO users can't log into Studio

Description: **Summary:** On Monday, October 14th, 2024, starting at 9:41 AM PDT, we received reports of Studio users receiving the error message “We’re sorry, but something went wrong” while attempting to log into Studio via Single Sign-On \(SSO\). Following a correlation of customer reports and initial troubleshooting, a platform service disruption incident was declared at 11:08 AM PDT and published on our Status Page at 11:11 AM PDT. **Severity:** Sev2 **Scope:** The scope of this service disruption was restricted to Studio users on the US platform attempting to log into Studio via SSO. Users who were already logged in before the incident, or used other authentication methods to log into Studio were unaffected. **Impact:** Users could not log into Studio via SSO for the duration of this incident \(1hr 37mins\). **Root Cause:** The root cause of this incident was attributed to an unexpected hardware failure on the AWS Redis cluster, which triggered a failover event at 9:35 AM PDT. The failover event caused disruptions to the authentication flow in the Identity and Access Management \(IAM\) Redis service, which did not re-establish connections to the failover cluster, leading to the SSO login error. **Mitigation:** To mitigate this incident, the IAM service was restarted at 11:12 AM PDT to refresh the connections to the failover Redis cluster, which restored the Studio SSO logging service. **Recurrence Prevention:** To prevent this incident from recurring, we will perform the following: * Introduced self-healing for IAM to automatically reconnect to Redis following failover events. This enhancement will be released in our upcoming Scheduled Software Release maintenance window on November 12th, 2024. * Perform a gap analysis of the already existing IAM monitoring and alerting dashboard.

Status: Postmortem

Impact: None | Started At: Oct. 14, 2024, 6:11 p.m.

Updates:

Time: Oct. 31, 2024, 6:39 p.m.

Status: Postmortem

Update: **Summary:** On Monday, October 14th, 2024, starting at 9:41 AM PDT, we received reports of Studio users receiving the error message “We’re sorry, but something went wrong” while attempting to log into Studio via Single Sign-On \(SSO\). Following a correlation of customer reports and initial troubleshooting, a platform service disruption incident was declared at 11:08 AM PDT and published on our Status Page at 11:11 AM PDT. **Severity:** Sev2 **Scope:** The scope of this service disruption was restricted to Studio users on the US platform attempting to log into Studio via SSO. Users who were already logged in before the incident, or used other authentication methods to log into Studio were unaffected. **Impact:** Users could not log into Studio via SSO for the duration of this incident \(1hr 37mins\). **Root Cause:** The root cause of this incident was attributed to an unexpected hardware failure on the AWS Redis cluster, which triggered a failover event at 9:35 AM PDT. The failover event caused disruptions to the authentication flow in the Identity and Access Management \(IAM\) Redis service, which did not re-establish connections to the failover cluster, leading to the SSO login error. **Mitigation:** To mitigate this incident, the IAM service was restarted at 11:12 AM PDT to refresh the connections to the failover Redis cluster, which restored the Studio SSO logging service. **Recurrence Prevention:** To prevent this incident from recurring, we will perform the following: * Introduced self-healing for IAM to automatically reconnect to Redis following failover events. This enhancement will be released in our upcoming Scheduled Software Release maintenance window on November 12th, 2024. * Perform a gap analysis of the already existing IAM monitoring and alerting dashboard.
Time: Oct. 22, 2024, 2:14 p.m.

Status: Resolved

Update: SSO log-in into Studio has remained available and fully functional throughout the monitoring phase of this incident. This incident is now resolved.
Time: Oct. 14, 2024, 6:20 p.m.

Status: Monitoring

Update: The reported issue has been mitigated, and Studio is now available via SSO. We will place the impacted services under monitoring for now.
Time: Oct. 14, 2024, 6:13 p.m.

Status: Identified

Update: We have identified a potential cause of this issue, and are working to resolve it. Another update will be provided within 1 hour.
Time: Oct. 14, 2024, 6:11 p.m.

Status: Investigating

Update: We are investigating reports of users being unable to log into Studio via Single Sign-On (SSO). We will provide you with an update within 1 hour.

Platform Service Degradation - EE Shortcuts and Assistant Intermittently Unavailable

Description: ## Summary: On September 30th, 2024, beginning at approximately 1:24 PM PDT \(20:24 UTC\), we started receiving reports of Shortcuts intermittently being unavailable and the Assistant returning an error in the Employee Experience. A platform incident was declared at 2:36 PM PDT \(21:36 UTC\) after initial investigations revealed the issue to be platform-wide. ## Severity: Sev2 ## Scope: Any user on the US platform accessing the Web or Mobile Experiences intermittently experienced missing Shortcuts and/or received an error message while accessing the Assistant. A refresh of the Employee Experience page occasionally restored these endpoints. All other services in the Employee Experience remained available and functional. ## Impact: Shortcuts and the Assistant endpoints in the Employee Experience were intermittently unavailable during the incident. ## Root Cause: The root cause was determined to be due to an uncharacteristically high number of new user integrations introduced within a short period of time that exacerbated a newly uncovered non-optimized content caching behavior. This caused downstream latency and increased error rates served by the web service responsible for rendering shortcuts and the assistant notification page. ## Mitigation: The immediate impact was mitigated by restarting the Employee Experience integrations API, and services were restored by 2:42 PM PDT \(21:42 UTC\). While investigations into the root cause continued, the incident recurred the following day – October 1st, 2024, at 12:54 PM PDT \(19:54 UTC\). The Employee Experience integrations API and the dependent Employee Experience user-integrations request processing service \(Pythia\) were restarted, restoring Shortcuts and the Assistant endpoints by 1:46 PM PDT \(20:46 UTC\). Cache resources for Pythia were increased to mitigate the observed latency. ## Recurrence Prevention: To prevent this incident from recurring, our engineering incident response team: * Has developed a fix to optimize how user-integrations requests use the cache to reduce memory consumption and eliminate latency. * This fix will be released during our scheduled Software Release maintenance window on October 15th, 2024. * Will be adding a monitoring and alerting dashboard for the Employee Experience user-integrations requests processing service \(Pythia\).

Status: Postmortem

Impact: None | Started At: Sept. 30, 2024, 9:36 p.m.

Updates:

Time: Oct. 9, 2024, 2:47 p.m.

Status: Postmortem

Update: ## Summary: On September 30th, 2024, beginning at approximately 1:24 PM PDT \(20:24 UTC\), we started receiving reports of Shortcuts intermittently being unavailable and the Assistant returning an error in the Employee Experience. A platform incident was declared at 2:36 PM PDT \(21:36 UTC\) after initial investigations revealed the issue to be platform-wide. ## Severity: Sev2 ## Scope: Any user on the US platform accessing the Web or Mobile Experiences intermittently experienced missing Shortcuts and/or received an error message while accessing the Assistant. A refresh of the Employee Experience page occasionally restored these endpoints. All other services in the Employee Experience remained available and functional. ## Impact: Shortcuts and the Assistant endpoints in the Employee Experience were intermittently unavailable during the incident. ## Root Cause: The root cause was determined to be due to an uncharacteristically high number of new user integrations introduced within a short period of time that exacerbated a newly uncovered non-optimized content caching behavior. This caused downstream latency and increased error rates served by the web service responsible for rendering shortcuts and the assistant notification page. ## Mitigation: The immediate impact was mitigated by restarting the Employee Experience integrations API, and services were restored by 2:42 PM PDT \(21:42 UTC\). While investigations into the root cause continued, the incident recurred the following day – October 1st, 2024, at 12:54 PM PDT \(19:54 UTC\). The Employee Experience integrations API and the dependent Employee Experience user-integrations request processing service \(Pythia\) were restarted, restoring Shortcuts and the Assistant endpoints by 1:46 PM PDT \(20:46 UTC\). Cache resources for Pythia were increased to mitigate the observed latency. ## Recurrence Prevention: To prevent this incident from recurring, our engineering incident response team: * Has developed a fix to optimize how user-integrations requests use the cache to reduce memory consumption and eliminate latency. * This fix will be released during our scheduled Software Release maintenance window on October 15th, 2024. * Will be adding a monitoring and alerting dashboard for the Employee Experience user-integrations requests processing service \(Pythia\).
Time: Oct. 9, 2024, 2:47 p.m.

Status: Resolved

Update: Employee Experience Shortcuts and the Assistant have remained available and fully functional throughout the monitoring phase of this incident. This incident is now resolved.
Time: Oct. 1, 2024, 9:19 p.m.

Status: Monitoring

Update: We have restarted the offending backend service to restore the affected functionalities. Shortcuts and the Assistant are now available. We will place these services back to monitoring for now.
Time: Oct. 1, 2024, 8:28 p.m.

Status: Investigating

Update: We are currently investigating a recurrence of this issue. We will provide you with an update in 1 hour.
Time: Sept. 30, 2024, 9:55 p.m.

Status: Monitoring

Update: We have identified and restarted the offending backend services to restore the affected services. Shortcuts and the Assistant are now available. We will place these services under monitoring for now.
Time: Sept. 30, 2024, 9:36 p.m.

Status: Investigating

Update: We are currently investigating reports of shortcuts in the Employee Experience intermittently being unavailable, as well as an error message being returned while trying to access the assistant. We will provide you with an update in 1 hour.

US Web Experience is currently unavailable

Description: **Summary:** On September 16th, 2024, starting at around 11:00 AM PDT, we started receiving customer reports stating that the Web and Mobile Experiences endpoints were unavailable. Following a correlation of these reports and system monitors, a platform incident was declared at 11:14 AM PDT. ‌ **Severity:** Sev1 ‌ **Scope:** Any user on the US platform attempting to access the Web and Mobile Experiences intermittently received an error message, and the Employee Experience failed to load. ‌ **Impact:** The core Web and Mobile Experiences platform endpoints were intermittently unavailable for the duration of the incident \(1hr 38mins\). **Root Cause:** The root cause was determined to be an exhaustion of the available database connections due to a sudden burst of user engagement activity that correlated to a small number of high-visibility campaigns. At 10:50 AM PDT, a dependent back-end service entered into a crash loop back-off state due to the database connection requests being refused and returned the error message to end users. ‌ **Mitigation:** The immediate problem was mitigated by fully redeploying the Employee Experience microservice after initial failed attempts at more surgical standardized mitigation maneuvers proved ineffective. Earlier maneuvers focused on reducing database load by temporarily disabling platform features and functionality that make heavy use of database transactions, which reduced error rates overall, but did not eliminate Customer impact. Web and Mobile Experience availability was restored by 12:28 PM PDT. **Recurrence Prevention:** To prevent this incident from recurring, our engineering incident response team has: * Increased the available database connections by 40% to account for any unforeseen spikes in platform traffic. * Added circuit breakers that would intercept abnormal increases in platform traffic, thereby maintaining platform endpoints availability. * Added an additional incident mitigation maneuver to disable campaign reactions such that a full-service redeploy would not be required to restore platform availability.

Status: Postmortem

Impact: None | Started At: Sept. 16, 2024, 6:14 p.m.

Updates:

Time: Sept. 18, 2024, 12:43 a.m.

Status: Postmortem

Update: **Summary:** On September 16th, 2024, starting at around 11:00 AM PDT, we started receiving customer reports stating that the Web and Mobile Experiences endpoints were unavailable. Following a correlation of these reports and system monitors, a platform incident was declared at 11:14 AM PDT. ‌ **Severity:** Sev1 ‌ **Scope:** Any user on the US platform attempting to access the Web and Mobile Experiences intermittently received an error message, and the Employee Experience failed to load. ‌ **Impact:** The core Web and Mobile Experiences platform endpoints were intermittently unavailable for the duration of the incident \(1hr 38mins\). **Root Cause:** The root cause was determined to be an exhaustion of the available database connections due to a sudden burst of user engagement activity that correlated to a small number of high-visibility campaigns. At 10:50 AM PDT, a dependent back-end service entered into a crash loop back-off state due to the database connection requests being refused and returned the error message to end users. ‌ **Mitigation:** The immediate problem was mitigated by fully redeploying the Employee Experience microservice after initial failed attempts at more surgical standardized mitigation maneuvers proved ineffective. Earlier maneuvers focused on reducing database load by temporarily disabling platform features and functionality that make heavy use of database transactions, which reduced error rates overall, but did not eliminate Customer impact. Web and Mobile Experience availability was restored by 12:28 PM PDT. **Recurrence Prevention:** To prevent this incident from recurring, our engineering incident response team has: * Increased the available database connections by 40% to account for any unforeseen spikes in platform traffic. * Added circuit breakers that would intercept abnormal increases in platform traffic, thereby maintaining platform endpoints availability. * Added an additional incident mitigation maneuver to disable campaign reactions such that a full-service redeploy would not be required to restore platform availability.
Time: Sept. 18, 2024, 12:43 a.m.

Status: Resolved

Update: All affected endpoints have remained stable and available. This incident is now resolved.
Time: Sept. 17, 2024, 6:09 p.m.

Status: Monitoring

Update: We are continuing to monitor for any further issues.
Time: Sept. 16, 2024, 9:38 p.m.

Status: Monitoring

Update: The unplanned performance enhancement maintenance to the Firstup cloud infrastructure is now completed. All services are now available and fully functional. Please notify our Customer Support team if you experience any issues with Firstup services following this notice.
Time: Sept. 16, 2024, 8:56 p.m.

Status: Monitoring

Update: Today at 2:30 PM PT / 9:30 PM UTC we will be performing unplanned maintenance to shore up Firstup cloud infrastructure as a preventative measure based on technical troubleshooting done since the incident was initially mitigated earlier today. This change may result in a service disruption lasting from a few seconds to several minutes as the changes take effect. We expect to be in a much more stable state as root cause troubleshooting continues following the completion of the maintenance.
Time: Sept. 16, 2024, 7:41 p.m.

Status: Monitoring

Update: Web and Mobile Experiences have now been restored. We will be placing the offending services under monitoring for now.
Time: Sept. 16, 2024, 7:27 p.m.

Status: Identified

Update: We are continuing to work on a fix for this issue.
Time: Sept. 16, 2024, 7:26 p.m.

Status: Identified

Update: We continue to work on relieving the pressure on database resources, and the current user experience is intermittent and partial access to the Employee Experience (on both desktop and mobile EE). Another update in 30 minutes.
Time: Sept. 16, 2024, 6:55 p.m.

Status: Identified

Update: We are working on relieving pressure on database resources to restore services. Another update in 30 minutes.
Time: Sept. 16, 2024, 6:31 p.m.

Status: Identified

Update: We have identified a potential cause of this service outage, and are working to restore services. Another update in 30 minutes.
Time: Sept. 16, 2024, 6:14 p.m.

Status: Investigating

Update: We are currently investigating reports of the US Web Experience being unavailable. Studio remains available

Is there an Firstup outage?

Firstup status: Systems Active

Firstup outages and incidents

There have been 2 outages or incidents for Firstup in the last 30 days.

Severity Breakdown:

Tired of searching for status updates?

Components and Services Monitored for Firstup

Latest Firstup outages and incidents.

Platform Service Degradation - Intermittent Studio Slow Performance

Updates:

Platform Service Degradation - Some Platform Emails Not Being Delivered

Updates:

Platform Service Disruption - SSO users can't log into Studio

Updates:

Platform Service Degradation - EE Shortcuts and Assistant Intermittently Unavailable

Updates:

US Web Experience is currently unavailable

Updates:

Check the status of similar companies and alternatives to Firstup

Akamai

Nutanix

MongoDB

LogicMonitor

Acquia

Granicus System

CareCloud

Redis

integrator.io

NinjaOne Trust

Pantheon Operations

Securiti US

Frequently Asked Questions - Firstup

Is there a Firstup outage?

Where can I find the official status page of Firstup?

How can I get notified if Firstup is down or experiencing an outage?

Start monitoring now!