Last checked: 4 seconds ago
Get notified about any outages, downtime or incidents for Confluence and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Confluence.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Administration | Active |
Authentication and User Management | Active |
Cloud to Cloud Migrations - Copy Product Data | Active |
Comments | Active |
Confluence Automations | Active |
Create and Edit | Active |
Marketplace Apps | Active |
Notifications | Active |
Purchasing & Licensing | Active |
Search | Active |
Server to Cloud Migrations - Copy Product Data | Active |
Signup | Active |
View Content | Active |
Mobile | Active |
Android App | Active |
iOS App | Active |
View the latest incidents for Confluence and check for official updates:
Description: ### Summary On February 28, 2024, between 12:17 UTC and 15:23 UTC, Jira and Confluence apps built on the Connect platform were unable to perform any actions on behalf of users. Some apps may have retried and later succeeded these actions, whereas others may have failed the request. The incident was detected within four minutes by automated monitoring of service reliability and mitigated by manual scaling of the service which put Atlassian systems into a known good state. The total time to resolution was about three hours and six minutes. ### **Technical Summary** On February 28, 2024, between 12:17 UTC and 15:23 UTC, Jira and Confluence apps built on the Connect platform were unable to perform token exchanges specifically for the purpose of user impersonation requests initiated by the app. The event was triggered by the failure of the oauth-2-authorization-server service to scale as the load increased. The unavailability of this service and apps retrying failing requests created a feedback loop, compounding the impacts of the service not scaling. The problem impacted customers in all regions. The incident was detected within four minutes by automated monitoring of service reliability and mitigated by manual scaling of the service which put Atlassian systems into a known good state. The total time to resolution was about three hours and six minutes. ### **IMPACT** The overall impact was on February 28, 2024, between 12:17 UTC and 15:23 UTC, and impacted Connect apps for Jira and Confluence products that relied on the user impersonation feature_._ The incident caused service disruption to customers in all regions. Apps that made requests to act on behalf of users would have seen some of their requests failing throughout the incident. Where apps had retry mechanisms in place, these requests may have eventually succeeded once the service was in a good state. Impacted apps received HTTP 502 and 503 errors as well as request timeouts when making requests to the oauth-2-authorization-server service. Product functionality such as automation rules in Automation for Jira are partially built on the Connect platform, and some of these were impacted. During the impact window, Automation rules performing rule executions on behalf of a user instead of Automation for Jira failed to authenticate. Rules that failed were recorded in the Automation Audit Log. Additionally, manually triggered rules would have failed to trigger, these will not appear in the Automation Audit Log. Overall, this impacted approximately 2% of all rules run in the impact window. Automation for Confluence was not impacted. ### **ROOT CAUSE** The issue was caused by an increase in traffic to the oauth-2-authorization-server service in the US-East region and the service not autoscaling in response to the increased load. As the service began to fail requests, apps retried the requests, which further increased the service load. By adding additional processing resources \(scaling the nodes\), the service was able to handle the increased load and restore availability. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because the load conditions had not been encountered previously. The service has operated for many years in its current configuration and has never experienced this particular failure mode where traffic ramped faster than our ability to scale. As such, the scaling controls were not exercised and when required they did not proactively scale the oauth-2-authorization-server service due to the CPU scaling threshold never being reached. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity. We are prioritizing the following improvement actions to avoid repeating this type of incident: * The CPU threshold for scaling the service has been lowered significantly so that scaling will begin much earlier as the service load increases in each region. * We are updating our scaling policy to switch to step scaling in order to more rapidly scale capacity if there are significant load increases. * We have increased the minimum number of nodes for the service and will monitor service behaviour to see what should be the optimal minimum scaling value. * Further analysis of rate limiting being triggered will be undertaken to determine if apps are responding to rate limiting appropriately. The service rate limiting is described in [https://developer.atlassian.com/cloud/confluence/user-impersonation-for-connect-apps/#rate-limiting](https://developer.atlassian.com/cloud/confluence/user-impersonation-for-connect-apps/#rate-limiting). * Longer-term, network-based rate limiting will be explored to avoid a misbehaving app overloading the service. We apologize to customers, partners, and developers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability. Thanks, Atlassian Customer Support
Status: Postmortem
Impact: Critical | Started At: Feb. 28, 2024, 1:41 p.m.
Description: ### Summary On 21/02/2023, between 2:30am and 4:15am UTC, Atlassian customers using Jira Software, Jira Service Management, Jira Work Management and Confluence Cloud products were unable to view issues or pages. The event was triggered by a change to Atlassian's Network \(Edge\) infrastructure, where an incorrect security credential was deployed. This impacted requests to Atlassian's Cloud originating from the Europe and South Asia regions. The incident was detected within 21 minutes by monitoring and mitigated by a failover to other Edge regions and a rollback of the failed deployment which put Atlassian systems into a known good state. The total time to resolution was about 1 hour and 45 minutes. ### **IMPACT** The failed change impacted 3 out of the 14 Atlassian Cloud regions \(Europe/Frankfort, Europe/Dublin, and India/Mumbai\). Between 21/02/2023 2:30am and 04:15am UTC, end-users may have experience intermittent errors or complete service disruption for multiple Cloud products. As the traffic is directed to Atlassian Cloud using DNS latency-based records, only the traffic originating from locations close to Europe and India was impacted. ### **ROOT CAUSE** A change to our Network Infrastructure used faulty credentials. As a result, customer authentication requests could not be validated, and requests were returned with a 500 or 503 errors. After investigation, it was found that the health-check and tests which should have prevented the faulty credentials to reach the production environment, contained a bug and never indicating a fault. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified in our dev and staging environments because the new credentials were only valid for production. We are prioritizing the following improvement actions to avoid repeating this type of incident: * Improving end-to-end healthchecks * Faster rollback of our infrastructure deployment * Improved monitoring Furthermore, we deploy our changes progressively \(by cloud region\) to avoid broad impact but in this case, our detection and health-checks did not work as expected. To minimise the impact of breaking changes to our environments, we will implement additional preventative measures such as: * Canary and shakedown deployments with automated rollback We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability. Thanks, Atlassian Customer Support
Status: Postmortem
Impact: Major | Started At: Feb. 21, 2024, 3:41 a.m.
Description: ### Summary On 21/02/2023, between 2:30am and 4:15am UTC, Atlassian customers using Jira Software, Jira Service Management, Jira Work Management and Confluence Cloud products were unable to view issues or pages. The event was triggered by a change to Atlassian's Network \(Edge\) infrastructure, where an incorrect security credential was deployed. This impacted requests to Atlassian's Cloud originating from the Europe and South Asia regions. The incident was detected within 21 minutes by monitoring and mitigated by a failover to other Edge regions and a rollback of the failed deployment which put Atlassian systems into a known good state. The total time to resolution was about 1 hour and 45 minutes. ### **IMPACT** The failed change impacted 3 out of the 14 Atlassian Cloud regions \(Europe/Frankfort, Europe/Dublin, and India/Mumbai\). Between 21/02/2023 2:30am and 04:15am UTC, end-users may have experience intermittent errors or complete service disruption for multiple Cloud products. As the traffic is directed to Atlassian Cloud using DNS latency-based records, only the traffic originating from locations close to Europe and India was impacted. ### **ROOT CAUSE** A change to our Network Infrastructure used faulty credentials. As a result, customer authentication requests could not be validated, and requests were returned with a 500 or 503 errors. After investigation, it was found that the health-check and tests which should have prevented the faulty credentials to reach the production environment, contained a bug and never indicating a fault. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified in our dev and staging environments because the new credentials were only valid for production. We are prioritizing the following improvement actions to avoid repeating this type of incident: * Improving end-to-end healthchecks * Faster rollback of our infrastructure deployment * Improved monitoring Furthermore, we deploy our changes progressively \(by cloud region\) to avoid broad impact but in this case, our detection and health-checks did not work as expected. To minimise the impact of breaking changes to our environments, we will implement additional preventative measures such as: * Canary and shakedown deployments with automated rollback We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability. Thanks, Atlassian Customer Support
Status: Postmortem
Impact: Major | Started At: Feb. 21, 2024, 3:41 a.m.
Description: ### **Summary** On February 14, 2024, between 20:05 UTC and 23:03 UTC, Atlassian customers on the following cloud products encountered a service disruption: Access, Atlas, Atlassian Analytics, Bitbucket, Compass, Confluence, Ecosystem apps, Jira Service Management, Jira Software, Jira Work Management, Jira Product Discovery, Opsgenie, StatusPage, and Trello. As part of a security and compliance uplift, we had scheduled the deletion of unused and legacy domain names used for internal service-to-service connections. Active domain names were incorrectly deleted during this event. This impacted all cloud customers across all regions. The issue was identified and resolved through the rollback of the faulty deployment to restore the domain names and Atlassian systems to a stable state. The time to resolution was two hours and 58 minutes. ### **IMPACT** External customers started reporting issues with Atlassian cloud products at 20:52 UTC. The impact of the failed change led to performance degradation or in some cases, complete service disruption. Symptoms experienced by end-users were unsuccessful page loads and/or failed interactions with our cloud products. ### **ROOT CAUSE** As part of a security and compliance uplift, we had scheduled the deletion of unused and legacy domain names that were being used for internal service-to-service connections. Active domain names were incorrectly deleted during this operation. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity. The detection was delayed because existing testing & monitoring focused on service health rather than the entire system’s availability. To prevent a recurrence of this type of incident, we are implementing the following improvement measures: * Canary checks to monitor the entire system availability. * Faster rollback procedures for this type of service impact. * Stricter change control procedures for infrastructure modifications. * Migration of all DNS records to centralised management and stricter access controls on modification to DNS records. We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability. Thanks, Atlassian Customer Support
Status: Postmortem
Impact: None | Started At: Feb. 14, 2024, 9:57 p.m.
Description: ### **Summary** On February 14, 2024, between 20:05 UTC and 23:03 UTC, Atlassian customers on the following cloud products encountered a service disruption: Access, Atlas, Atlassian Analytics, Bitbucket, Compass, Confluence, Ecosystem apps, Jira Service Management, Jira Software, Jira Work Management, Jira Product Discovery, Opsgenie, StatusPage, and Trello. As part of a security and compliance uplift, we had scheduled the deletion of unused and legacy domain names used for internal service-to-service connections. Active domain names were incorrectly deleted during this event. This impacted all cloud customers across all regions. The issue was identified and resolved through the rollback of the faulty deployment to restore the domain names and Atlassian systems to a stable state. The time to resolution was two hours and 58 minutes. ### **IMPACT** External customers started reporting issues with Atlassian cloud products at 20:52 UTC. The impact of the failed change led to performance degradation or in some cases, complete service disruption. Symptoms experienced by end-users were unsuccessful page loads and/or failed interactions with our cloud products. ### **ROOT CAUSE** As part of a security and compliance uplift, we had scheduled the deletion of unused and legacy domain names that were being used for internal service-to-service connections. Active domain names were incorrectly deleted during this operation. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity. The detection was delayed because existing testing & monitoring focused on service health rather than the entire system’s availability. To prevent a recurrence of this type of incident, we are implementing the following improvement measures: * Canary checks to monitor the entire system availability. * Faster rollback procedures for this type of service impact. * Stricter change control procedures for infrastructure modifications. * Migration of all DNS records to centralised management and stricter access controls on modification to DNS records. We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability. Thanks, Atlassian Customer Support
Status: Postmortem
Impact: None | Started At: Feb. 14, 2024, 9:57 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.