Last checked: 7 minutes ago
Get notified about any outages, downtime or incidents for Firstup and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Firstup.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
3rd-Party Dependencies | Active |
Identity Access Management | Active |
Image Transformation API | Active |
SendGrid API v3 | Active |
Zoom Zoom Virtual Agent | Active |
Ecosystem | Active |
Connect | Active |
Integrations | Active |
Partner API | Active |
User Sync | Active |
Platforms | Active |
EU Firstup Platform | Active |
US Firstup Platform | Active |
Products | Active |
Classic Studio | Active |
Creator Studio | Active |
Insights | Active |
Microapps | Active |
Mobile Experience | Active |
Web Experience | Active |
View the latest incidents for Firstup and check for official updates:
Description: **Summary:** On May 2nd, 2024, at 10:39 AM EDT, a system monitor alerted us of a potential issue where the disk space on a service used to pass messages between backend workers was approaching critical “free disk space limits”. As we started looking at the event condition, customer reports of various Studio functions experiencing issues started coming in, including but not limited to the following conditions: * Unable to send test campaigns * Processing error messages * Test campaign emails are not being delivered * White screens * Studio loading issues A platform incident was declared at 12:41 PM EDT, and the incident response team was engaged to diagnose the reported issues. **Impact:** The impact was determined to affect all Studio users who attempted to connect to Studio or initiate new Studio activities. **Root Cause:** The incident response team identified that one of the queues in the impacted service was backed up, in effect utilizing too much memory, which led to the out-of-memory condition. As a result, new Studio service requests could not establish connections to this service. The inability to establish connections to the service presented itself as the aforementioned customer-reported issues. **Mitigation:** To restore Studio services, the backed-up queue was purged at around 1:00 PM EDT to free up memory, which increased the available disk space for the service. This allowed for other queues to continue processing, as well as new Studio service requests to gain a connection to the service, and process successfully. For any affected transactions that were stuck during the purge, such as scheduled campaigns that did not publish, these were manually published. No customer data was lost from purging the queue. **Recurrence Prevention:** To prevent a recurrence of this incident, we have since deployed a hotfix to the code that checks if the queue size is over a certain limit before queueing more messages, to prevent this exact out-of-memory failure scenario.
Status: Postmortem
Impact: None | Started At: May 2, 2024, 4:50 p.m.
Description: **Summary:** On April 30th, 2024, starting at 8:18 AM EDT, we began to receive reports that User Sync files were failing to process, and the following error message was returned: * Failed to decrypt uploaded file. Please ensure that the correct encryption key and format is used. * The encryption key expected to be used is \[Key Fingerprint\]. A platform incident was declared at 10:23 AM EDT and was fully mitigated by 10:59 AM EDT. **Scope and Impact:** The scope of this incident was isolated to only customers who encrypt their User Sync file before uploading it. The impact of this incident was restricted to customers who had uploaded an encrypted User Sync file between 10:03 PM EDT on April 29th, 2024, and 10:59 AM EDT on April 30th, 2024. **Root Cause:** The incident response team identified that this incident resulted from a regression to a software release on April 29th, 2024. It was identified that the OS image used to deploy the upgrade lacked crucial packages for decryption. **Mitigation:** At 10:59 AM EDT, the released upgrade was rolled back to its previous version which contained the decryption packages, to allow normal decryption of encrypted User Sync files. We also identified and reprocessed any encrypted customer User Sync files that had failed to process within the duration of the incident. **Recurrence Prevention:** A technical team post-mortem meeting reviewed that the zip-based deployment of the OS had no controls over updating or re-deploying the upgrade. We therefore transitioned to an image-based deployment which allowed for greater control over the OS image and the necessary dependencies. The upgrade was later redeployed on May 6th, 2024, using the OS image that included the necessary decryption packages. We also: * Added additional monitoring and alerting for the health of external-registration \(User Sync files processing\). * Updated regression test packs to include testing user sync with encrypted files.
Status: Postmortem
Impact: None | Started At: April 30, 2024, 2:32 p.m.
Description: **Summary:** On April 30th, 2024, starting at 8:18 AM EDT, we began to receive reports that User Sync files were failing to process, and the following error message was returned: * Failed to decrypt uploaded file. Please ensure that the correct encryption key and format is used. * The encryption key expected to be used is \[Key Fingerprint\]. A platform incident was declared at 10:23 AM EDT and was fully mitigated by 10:59 AM EDT. **Scope and Impact:** The scope of this incident was isolated to only customers who encrypt their User Sync file before uploading it. The impact of this incident was restricted to customers who had uploaded an encrypted User Sync file between 10:03 PM EDT on April 29th, 2024, and 10:59 AM EDT on April 30th, 2024. **Root Cause:** The incident response team identified that this incident resulted from a regression to a software release on April 29th, 2024. It was identified that the OS image used to deploy the upgrade lacked crucial packages for decryption. **Mitigation:** At 10:59 AM EDT, the released upgrade was rolled back to its previous version which contained the decryption packages, to allow normal decryption of encrypted User Sync files. We also identified and reprocessed any encrypted customer User Sync files that had failed to process within the duration of the incident. **Recurrence Prevention:** A technical team post-mortem meeting reviewed that the zip-based deployment of the OS had no controls over updating or re-deploying the upgrade. We therefore transitioned to an image-based deployment which allowed for greater control over the OS image and the necessary dependencies. The upgrade was later redeployed on May 6th, 2024, using the OS image that included the necessary decryption packages. We also: * Added additional monitoring and alerting for the health of external-registration \(User Sync files processing\). * Updated regression test packs to include testing user sync with encrypted files.
Status: Postmortem
Impact: None | Started At: April 30, 2024, 2:32 p.m.
Description: ## Summary: On February 8th, 2024, beginning at approximately 1:56 PM EST \(18:56 UTC\), we started receiving reports of Studio not performing as expected. The symptoms observed by some Studio users included: · A “failed to fetch” or a “504 Gateway Timeout” error message. · Unusually slow performance. A recurrence of this incident was also observed on April 24th, 2024. ## Impact: Studio users who were actively trying to navigate through and use any Studio functions during the duration of these incidents were impacted by the service disruption. ## Root Cause: It was identified that Studio services were failing to establish a TCP connection to the Identity and Access Management service \(IAM\) due to a backup of TCP connection requests. The backup of TCP connection requests resulted from other “already failed” connection requests that were not dropped because they kept retrying to establish a connection for an extended period. ## Mitigation: On both days, the immediate problem was mitigated by restarting the backend services that had failed TCP connection attempts, in effect purging the connection request queue of stale connections and allowing new connections to be established with the IAM service. ## Remediation Steps: Our engineering team is working on reducing the time-to-live duration of all TCP connection requests to the IAM service from the default 60 seconds to 10 seconds. This will allow for failed connections to be dropped sooner and reduce the backup of connection requests to IAM. In addition, we have also implemented dashboards to track TCP connection failures, as well as set alerting thresholds on failed TCP connections to help us get ahead of a potential platform service disruption.
Status: Postmortem
Impact: None | Started At: April 24, 2024, 10:57 p.m.
Description: ## Summary: On February 8th, 2024, beginning at approximately 1:56 PM EST \(18:56 UTC\), we started receiving reports of Studio not performing as expected. The symptoms observed by some Studio users included: · A “failed to fetch” or a “504 Gateway Timeout” error message. · Unusually slow performance. A recurrence of this incident was also observed on April 24th, 2024. ## Impact: Studio users who were actively trying to navigate through and use any Studio functions during the duration of these incidents were impacted by the service disruption. ## Root Cause: It was identified that Studio services were failing to establish a TCP connection to the Identity and Access Management service \(IAM\) due to a backup of TCP connection requests. The backup of TCP connection requests resulted from other “already failed” connection requests that were not dropped because they kept retrying to establish a connection for an extended period. ## Mitigation: On both days, the immediate problem was mitigated by restarting the backend services that had failed TCP connection attempts, in effect purging the connection request queue of stale connections and allowing new connections to be established with the IAM service. ## Remediation Steps: Our engineering team is working on reducing the time-to-live duration of all TCP connection requests to the IAM service from the default 60 seconds to 10 seconds. This will allow for failed connections to be dropped sooner and reduce the backup of connection requests to IAM. In addition, we have also implemented dashboards to track TCP connection failures, as well as set alerting thresholds on failed TCP connections to help us get ahead of a potential platform service disruption.
Status: Postmortem
Impact: None | Started At: April 24, 2024, 10:57 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.