Last checked: 37 seconds ago
Get notified about any outages, downtime or incidents for Trello and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Trello.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
API | Active |
Atlassian Support Knowledge Base | Active |
Atlassian Support - Support Portal | Active |
Atlassian Support Ticketing | Active |
Trello.com | Active |
View the latest incidents for Trello and check for official updates:
Description: ### **SUMMARY** On Nov 30 2023, between 14:04 and 16:57 UTC, Atlassian customers using Trello experienced errors when accessing and interacting with the application. This incident impacted Trello users on the iOS and Android mobile apps as well as those using the Trello web app. The event was triggered by the release of a code change that eventually overloaded a critical part of the Trello database. The incident was detected immediately by our automated monitoring systems and was mitigated by disabling the relevant code change. The issue was extended by the failure of a secondary service whose recovery caused an increase in load on the same critical part of the Trello database, which created a negative feedback loop. This secondary service recovery involved reestablishing over a million connections, with each connection attempt adding load to the same part of the Trello database. We attempted to aid the service recovery by intentionally blocking some of the inbound Trello traffic to reduce load on the database and by increasing the capacity of the Trello database to better handle the high load. Over time the connections were all successfully reestablished, which returned Trello to a known good state. The total time to resolution was just under 3 hours. ### **IMPACT** The overall impact was between Nov 30 2023, 14:04 UTC and Nov 30 2023, 16:57 UTC on the Trello product. The incident caused service disruption to all Trello customers. Our metrics show there were elevated API response times and increased error rates through the entire incident period, which indicates that most users were unable to load Trello at all or easily interact with the application in any way. The particular database collection that was overloaded was one that is necessary for the Trello service to make authorization decisions, which meant that all requests were impacted. ### **ROOT CAUSE** The issue was caused by a series of changes intended to standardize Trello’s approach to authorizing requests, but had the unintended side effect of modifying a database query from a [targeted operation](https://www.mongodb.com/docs/manual/core/sharded-cluster-query-router/#std-label-sharding-mongos-targeted) to a [broadcast operation](https://www.mongodb.com/docs/manual/core/sharded-cluster-query-router/#std-label-sharding-mongos-broadcast). Broadcast operations are more resource-intensive as they must be sent to _all_ database servers to be satisfied. These broadcast operations eventually overloaded some of the Trello database servers as Trello approached its daily peak usage period on Nov 30 2023. 1. The first change of this type was deployed over a period of seven days at the end of August and changed the authorization type used by our websocket service. This meant that newly established websocket connections required this new broadcast query. At any given moment, we have a great deal of _established_ websocket connections, but the usual rate of _new_ websocket connections is relatively low. Therefore, our monitoring systems only detected a slight increase in resource usage and flagged this change as a low priority performance regression. We acknowledged the regression and created a task to identify and reduce the resource demands of these new queries. 2. The second change of this type was deployed over the course of a few days before being fully rolled out on Nov 29, 2023, the day before this incident. This change caused the Trello application server to use the new broadcast query while authorizing standard web browser traffic, which is the vast majority of our traffic. The change was fully deployed at 19:34 UTC on Nov 29, which was during a low traffic period. The next day, as the application approached its daily peak traffic period, our monitoring on the database servers indicated they were overloaded. When these database nodes were overloaded, users' HTTP requests received very slow responses or HTTP 504 errors. As we activated our load shedding strategies, some users received HTTP 429 errors. The incident’s length can be attributed to a secondary failure where our websocket servers experienced a rapid increase in memory leading to processes crashing with OutOfMemoryErrors. As new servers came online and the websockets attempted to reconnected, they once again generated the broadcast queries on the Trello database servers. These broadcast queries continued to put load on the database, which meant the Trello API continued to have high latency, thus perpetuating the negative feedback loop. We are working to determine the root cause of the OutOfMemoryErrors. We also determined after the incident that due to the Trello application server making the load shedding decision AFTER performing the authorization step, the overloaded database servers were still being queried before the request was rejected. We are working to improve our load shedding strategies post incident. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity and we are continually working to improve our testing and preventative processes to prevent similar outages in the future. We are prioritizing the following improvement actions to avoid repeating this type of incident: * Increase the capacity of our database \(completed during the incident\). * This action is the most critical and is aimed at preventing a recurrence of this particular incident and gracefully recover if the websocket service were to fail again. * Refactor the new authorization approach to avoid [broadcast operations](https://www.mongodb.com/docs/manual/core/sharded-cluster-query-router/#std-label-sharding-mongos-broadcast). * Add pre-deployment tests to avoid releasing unnecessary broadcast operations. * Determine the root cause of the secondary failure of the websocket service. Furthermore, we deploy our changes only after thorough review and automated testing, and we deploy them progressively using feature flags to avoid broad impact. To minimize the impact of breaking changes to our environments, we will implement additional preventative measures: * Ensure that our load-shedding strategies fail fast. * Add monitoring to observe [broadcast operations](https://www.mongodb.com/docs/manual/core/sharded-cluster-query-router/#std-label-sharding-mongos-broadcast) in all our environments. We apologize to customers who were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability. Thanks, Atlassian Customer Support
Status: Postmortem
Impact: Critical | Started At: Nov. 30, 2023, 2:21 p.m.
Description: ### **SUMMARY** On Nov 30 2023, between 14:04 and 16:57 UTC, Atlassian customers using Trello experienced errors when accessing and interacting with the application. This incident impacted Trello users on the iOS and Android mobile apps as well as those using the Trello web app. The event was triggered by the release of a code change that eventually overloaded a critical part of the Trello database. The incident was detected immediately by our automated monitoring systems and was mitigated by disabling the relevant code change. The issue was extended by the failure of a secondary service whose recovery caused an increase in load on the same critical part of the Trello database, which created a negative feedback loop. This secondary service recovery involved reestablishing over a million connections, with each connection attempt adding load to the same part of the Trello database. We attempted to aid the service recovery by intentionally blocking some of the inbound Trello traffic to reduce load on the database and by increasing the capacity of the Trello database to better handle the high load. Over time the connections were all successfully reestablished, which returned Trello to a known good state. The total time to resolution was just under 3 hours. ### **IMPACT** The overall impact was between Nov 30 2023, 14:04 UTC and Nov 30 2023, 16:57 UTC on the Trello product. The incident caused service disruption to all Trello customers. Our metrics show there were elevated API response times and increased error rates through the entire incident period, which indicates that most users were unable to load Trello at all or easily interact with the application in any way. The particular database collection that was overloaded was one that is necessary for the Trello service to make authorization decisions, which meant that all requests were impacted. ### **ROOT CAUSE** The issue was caused by a series of changes intended to standardize Trello’s approach to authorizing requests, but had the unintended side effect of modifying a database query from a [targeted operation](https://www.mongodb.com/docs/manual/core/sharded-cluster-query-router/#std-label-sharding-mongos-targeted) to a [broadcast operation](https://www.mongodb.com/docs/manual/core/sharded-cluster-query-router/#std-label-sharding-mongos-broadcast). Broadcast operations are more resource-intensive as they must be sent to _all_ database servers to be satisfied. These broadcast operations eventually overloaded some of the Trello database servers as Trello approached its daily peak usage period on Nov 30 2023. 1. The first change of this type was deployed over a period of seven days at the end of August and changed the authorization type used by our websocket service. This meant that newly established websocket connections required this new broadcast query. At any given moment, we have a great deal of _established_ websocket connections, but the usual rate of _new_ websocket connections is relatively low. Therefore, our monitoring systems only detected a slight increase in resource usage and flagged this change as a low priority performance regression. We acknowledged the regression and created a task to identify and reduce the resource demands of these new queries. 2. The second change of this type was deployed over the course of a few days before being fully rolled out on Nov 29, 2023, the day before this incident. This change caused the Trello application server to use the new broadcast query while authorizing standard web browser traffic, which is the vast majority of our traffic. The change was fully deployed at 19:34 UTC on Nov 29, which was during a low traffic period. The next day, as the application approached its daily peak traffic period, our monitoring on the database servers indicated they were overloaded. When these database nodes were overloaded, users' HTTP requests received very slow responses or HTTP 504 errors. As we activated our load shedding strategies, some users received HTTP 429 errors. The incident’s length can be attributed to a secondary failure where our websocket servers experienced a rapid increase in memory leading to processes crashing with OutOfMemoryErrors. As new servers came online and the websockets attempted to reconnected, they once again generated the broadcast queries on the Trello database servers. These broadcast queries continued to put load on the database, which meant the Trello API continued to have high latency, thus perpetuating the negative feedback loop. We are working to determine the root cause of the OutOfMemoryErrors. We also determined after the incident that due to the Trello application server making the load shedding decision AFTER performing the authorization step, the overloaded database servers were still being queried before the request was rejected. We are working to improve our load shedding strategies post incident. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity and we are continually working to improve our testing and preventative processes to prevent similar outages in the future. We are prioritizing the following improvement actions to avoid repeating this type of incident: * Increase the capacity of our database \(completed during the incident\). * This action is the most critical and is aimed at preventing a recurrence of this particular incident and gracefully recover if the websocket service were to fail again. * Refactor the new authorization approach to avoid [broadcast operations](https://www.mongodb.com/docs/manual/core/sharded-cluster-query-router/#std-label-sharding-mongos-broadcast). * Add pre-deployment tests to avoid releasing unnecessary broadcast operations. * Determine the root cause of the secondary failure of the websocket service. Furthermore, we deploy our changes only after thorough review and automated testing, and we deploy them progressively using feature flags to avoid broad impact. To minimize the impact of breaking changes to our environments, we will implement additional preventative measures: * Ensure that our load-shedding strategies fail fast. * Add monitoring to observe [broadcast operations](https://www.mongodb.com/docs/manual/core/sharded-cluster-query-router/#std-label-sharding-mongos-broadcast) in all our environments. We apologize to customers who were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability. Thanks, Atlassian Customer Support
Status: Postmortem
Impact: Critical | Started At: Nov. 30, 2023, 2:21 p.m.
Description: Forge Invocations had an 8 minute outage between 2023-11-29 03:05:13 UTC to 2023-11-29 03:13:27 UTC resulting in Smart Links failing. This service has recovered post this time period.
Status: Resolved
Impact: Minor | Started At: Nov. 29, 2023, 3 a.m.
Description: Forge Invocations had an 8 minute outage between 2023-11-29 03:05:13 UTC to 2023-11-29 03:13:27 UTC resulting in Smart Links failing. This service has recovered post this time period.
Status: Resolved
Impact: Minor | Started At: Nov. 29, 2023, 3 a.m.
Description: This incident has been resolved. If you're still seeing issues, please reach out at https://trello.com/contact/
Status: Resolved
Impact: Major | Started At: Nov. 16, 2023, 10:37 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.