Last checked: 8 minutes ago
Get notified about any outages, downtime or incidents for Flagsmith and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Flagsmith.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Admin Dashboard | Active |
Core API | Active |
Edge API | Active |
Public Website | Active |
View the latest incidents for Flagsmith and check for official updates:
Description: The incident was related to an erroneous DNS change. This has now been reverted and service should be back up and running. There may be a period of time where failures are still seen as we wait for DNS caching to be propagated.
Status: Resolved
Impact: Critical | Started At: Dec. 1, 2023, 3:10 p.m.
Description: ## **Summary** On September 5th at 09:45 UTC, we initiated a release that included a database migration aimed at introducing a new constraint to the table containing information related to flags. According to our pre-live tests, this task should not have taken more than 50 milliseconds. Unfortunately, during the release to production, due to the high throughput on a particular table that it needed to acquire a temporary lock on, this caused a backlog of blocked connections waiting on the migration to complete. This caused a knock on effect that exhausted the connections on the database and a full restart was necessary. Once the restart was complete, the connections were restored and service was resumed. This happened at 10:20 UTC. ## **Next Steps** We have researched the cause of the issue and we do still have further research to understand certain aspects. Our current plan in the meantime is to implement certain safeguards as can be found in the following links to the Postgres documentation which should help reduce any impact in the future. [https://www.postgresql.org/docs/11/runtime-config-client.html](https://www.postgresql.org/docs/11/runtime-config-client.html) [https://www.postgresql.org/docs/11/runtime-config-logging.html](https://www.postgresql.org/docs/11/runtime-config-logging.html) \(`log_lock_waits`\)
Status: Postmortem
Impact: Minor | Started At: Sept. 12, 2023, 11:14 a.m.
Description: **Summary** On September 5th at 09:45 UTC, we initiated a release that included a database migration aimed at introducing a new constraint to the table containing information related to flags. According to our pre-live tests, this task should not have taken more than 50 milliseconds. Unfortunately, during the release to production, due to the high throughput on a particular table that it needed to acquire a temporary lock on, this caused a backlog of blocked connections waiting on the migration to complete. This caused a knock on effect that exhausted the connections on the database and a full restart was necessary. Once the restart was complete, the connections were restored and service was resumed. This happened at 10:20 UTC. **Next Steps** We have researched the cause of the issue and we do still have further research to understand certain aspects. Our current plan in the meantime is to implement certain safeguards as can be found in the following links to the Postgres documentation which should help reduce any impact in the future. [https://www.postgresql.org/docs/11/runtime-config-client.html](https://www.postgresql.org/docs/11/runtime-config-client.html) [https://www.postgresql.org/docs/11/runtime-config-logging.html](https://www.postgresql.org/docs/11/runtime-config-logging.html) \(`log_lock_waits`\)
Status: Postmortem
Impact: Critical | Started At: Sept. 5, 2023, 9:45 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: Aug. 10, 2023, 12:15 p.m.
Description: ## Timeline We were alerted at 23:39 UTC on 18/07/2023 that the queue for our asynchronous task processor was above the acceptable threshold. Once our team was online in India at 2:59am UTC, the status page was updated. By this time the task processor queue had backed up and the application was not able to write flag change events to the datastore which powers the Edge API. We investigated multiple avenues to determine the cause of the issues but there were multiple ‘symptoms’ that made determining the root cause very difficult. One specific issue, which turned out to be a red herring, related to the functionality to forward core API requests to the Edge API. This process seemed to be taking much longer than expected. Much of the investigation was spent restricting the usage of this functionality. At around 9:30am UTC, the cause was attributed to a particular set of tasks in the queue which were causing the processor units to run out of memory. Once it was determined to be safe to do so, these tasks were removed from the queue. At 10:19 UTC the issue had been resolved and the queue had returned to normal, meaning that flag change events were being written to the Edge API datastore again. Any changes that were not processed at the time were also re-run to ensure that the state was consistent with the expected changes that had been made in the database. ## Issue Details The issue was caused by an environment in the Flagsmith platform that included 400 segments and nearly 5000 segment overrides. This meant that the environment document which is generated to power the Edge API was larger than it was possible for the task processor instances to load into memory, and subsequently write to the Edge API datastore. To compound the issue, these changes were made via the Flagsmith API which resulted in 1000s of tasks being generated to update the document in the Edge API datastore in a short space of time. Each of these needed to load the offending environment, causing the task processor instances to fall into a cycle of running out of memory. These tasks were slowly being blocked from being picked up again by the processors but the quantity meant that there were always new versions of the same \(or very similar\) tasks to pick up. ## Next Steps * Implement limits on the size of the environment document * This will primarily consist of implementing limits on the number of segments and features in a given projects, as well as limiting the total number of segment overrides in a given project. * Deprecate the functionality to forward requests from the Core API to the Edge API. All projects using the Edge API will need to ensure that all connected SDKs are using the Edge API only.
Status: Postmortem
Impact: Minor | Started At: July 19, 2023, 2:58 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.