Last checked: 7 minutes ago
Get notified about any outages, downtime or incidents for Flagsmith and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Flagsmith.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Admin Dashboard | Active |
Core API | Active |
Edge API | Active |
Public Website | Active |
View the latest incidents for Flagsmith and check for official updates:
Description: This incident has been resolved.
Status: Resolved
Impact: None | Started At: Feb. 27, 2023, 2:53 p.m.
Description: The issue with our downstream provider has been resolved. We will continue to monitor.
Status: Resolved
Impact: Minor | Started At: Feb. 24, 2023, 10:16 p.m.
Description: At 12:46 UTC on Thursday 18th August, our monitoring picked up an increased number of HTTP 502s being served by our API. Upon investigation it became evident that an unexpected increase in load on the PostgreSQL database that serves our Core API was causing our application to struggle to serve some requests and we saw increased latency on those that were being served. In an attempt to resolve the issue, we adjusted the settings in our ECS cluster to reduce the number of connections to the database. Unfortunately, making this change via our IaaC workflow meant that the ECS service tried to recreate all the tasks but couldn’t do so as the health reporting was unable to consistently report a healthy status. This meant that our Core API was essentially flapping up and down while it tried to reinstate all the tasks. During this period, our API was continuing to serve some requests, with increased latency, however, there would have been a large proportion of HTTP 502s still. Following the above, our engineering team looked into the requests that were causing the increased load. From our investigation, it was apparent that the increased load was all to our environment document endpoint \(which powers the local evaluation in our latest server side clients\). This endpoint, although usable in our Core API, is very intensive as it generates the whole environment document from our PostgreSQL database to return to the client in JSON form. This involves a large number of queries. The compounding factor was due to a bug in our Node client regarding request timeouts. The Node client takes an argument of requestTimeoutSeconds on instantiation, however, it passes this directly into the call to the Node Fetch’s library fetch function which expects the timeout to be passed in milliseconds. As such, if requestTimeoutSeconds was set to e.g. 3, the request would timeout in 3ms and retry \(3 times by default\). So, every time a Node client polled for the environment, it would be making 3 requests in ~9ms \(or as close to it as Node can manage\). We were able to block the traffic to this endpoint for the customer that was putting an unusual amount of load through it due to their configuration and the above bug in the Node client. Once we had blocked this traffic, the application began serving traffic as normal again. This occurred at 15:24 UTC. At this point, traffic to the Core API was back to normal and all requests were served successfully. To remediate this issue, we are stepping up our efforts to encourage all of our clients to move over to our Edge API which is immune to issues of this nature. We are also planning to make improvements to the existing Core API platform to help guard against these issues in the future: 1. The addition of caching to our environment document endpoint to improve performance / minimise database impact 2. The implementation of automated rate limiting to better protect the platform from issues of this nature If you’ve read this and are unsure how to migrate to our Edge API, you can find out everything you need to know [here](https://docs.flagsmith.com/advanced-use/edge-api).
Status: Postmortem
Impact: Major | Started At: Aug. 18, 2022, 1 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Critical | Started At: July 10, 2022, 5:53 p.m.
Description: As part of a new feature rollout, there was a large database migration that needed to take place. We knew that the migration would take some time, however, it should not have affected production traffic. Unfortunately, despite our health check returning unhealthy until all migrations are complete, AWS ECS promoted the new version of the API application before the migrations were complete. This meant that the code that was running was expecting certain columns / data to be available in the database which weren’t there yet. We are still investigating what caused ECS to promote the new version before the migrations were complete.
Status: Postmortem
Impact: None | Started At: July 1, 2022, 11:31 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.