Last checked: a minute ago
Get notified about any outages, downtime or incidents for Firstup and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Firstup.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|
View the latest incidents for Firstup and check for official updates:
Description: On February 12th, 2024, starting at around 9:20 AM ET, we received reports that some users were experiencing issues accessing Creator Studio. The reports included general slowness loading Studio functions, as well as system error messages such as “Failed to fetch. Try again” or “504 Bad Gateway”. During our investigations, it was identified that the number of available database connections was exhausted, and therefore new Creator Studio requests could not establish a connection to the database. To mitigate this situation, a rolling redeployment of dependent backend services that had connections to the database was performed to free up any “stuck” connections, thus allowing new Creator Studio requests to connect to the database. As a recurrence prevention measure to this incident, we are working on reducing the demand for database connections in the short term via methods such as: * Connection pooling * Read/write splitting In the long term, we will be looking at: * Upgrading our database infrastructure * Increasing our database connection limits.
Status: Postmortem
Impact: None | Started At: Feb. 12, 2024, 2:47 p.m.
Description: ## Summary: On February 8th, 2024, beginning at approximately 1:56 PM EST \(18:56 UTC\), we started receiving reports of Studio not performing as expected. The symptoms observed by some Studio users included: · A “failed to fetch” or a “504 Gateway Timeout” error message. · Unusually slow performance. A recurrence of this incident was also observed on April 24th, 2024. ## Impact: Studio users who were actively trying to navigate through and use any Studio functions during the duration of these incidents were impacted by the service disruption. ## Root Cause: It was identified that Studio services were failing to establish a TCP connection to the Identity and Access Management service \(IAM\) due to a backup of TCP connection requests. The backup of TCP connection requests resulted from other “already failed” connection requests that were not dropped because they kept retrying to establish a connection for an extended period. ## Mitigation: On both days, the immediate problem was mitigated by restarting the backend services that had failed TCP connection attempts, in effect purging the connection request queue of stale connections and allowing new connections to be established with the IAM service. ## Remediation Steps: Our engineering team is working on reducing the time-to-live duration of all TCP connection requests to the IAM service from the default 60 seconds to 10 seconds. This will allow for failed connections to be dropped sooner and reduce the backup of connection requests to IAM. In addition, we have also implemented dashboards to track TCP connection failures, as well as set alerting thresholds on failed TCP connections to help us get ahead of a potential platform service disruption.
Status: Postmortem
Impact: None | Started At: Feb. 8, 2024, 7:16 p.m.
Description: The root cause of this service disruption has been identified to be the same as the one posted in another incident’s postmortem. Please follow this [link](https://status.firstup.io/incidents/4kpmpkmvkrz0) for additional details.
Status: Postmortem
Impact: None | Started At: Jan. 25, 2024, 3:13 p.m.
Description: On January 10th, 2024, starting at 11:11AM ET, we observed that the “User Export” jobs queue started growing, causing such jobs to take longer than usual to process and complete. Our investigations revealed that a large User Export job caused our systems to hit a memory error, resulting in the job failing and restarting in a loop, and consequently preventing other User Export jobs from being processed. As a mitigation step, we increased the worker pod memory to accommodate the large User Export job, thereby allowing it to complete processing successfully, and freeing up the queue for backlogged jobs to resume and complete processing. The queue backlog took approximately 2 hours to catch up, and all jobs were completed by 1:23PM ET. As a long-term solution to the memory limitation, we implemented autoscaling to allow for memory to be allocated automatically as needed to improve the performance of the queue processing.
Status: Postmortem
Impact: None | Started At: Jan. 10, 2024, 4:21 p.m.
Description: This service disruption was caused by the deployment of a hot fix that was intended to fix another issue on the platform. The hot fix was rolled back in order to restore the affected services. Our Engineering and Quality Assurance teams reviewed the problematic hot fix, corrected any negative dependencies, and ensured that its redeployment caused no further regressions on the platform.
Status: Postmortem
Impact: None | Started At: Nov. 15, 2023, 11:31 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.