Last checked: 6 minutes ago
Get notified about any outages, downtime or incidents for Jira and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Jira.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Administration | Active |
Authentication and User Management | Active |
Automation for Jira | Active |
Create and edit | Active |
Marketplace | Active |
Mobile | Active |
Notifications | Active |
Purchasing & Licensing | Active |
Search | Active |
Signup | Active |
Viewing content | Active |
View the latest incidents for Jira and check for official updates:
Description: ### Summary On February 29, 2024, between 05:11 and 08:46 UTC, customers of Jira Software, Jira Service Management, Jira Work Management, Jira Product Discovery, and Confluence experienced an incorrect data update of Automation Rules. This was due to an incomplete feature flag rollout and bugs in the data upgrade code. 99.99% of affected Automation Rules were remediated by 17:00 UTC on February 29, 2024. The remaining 0.01% of Rule Components were edited by users during the impacted time window and required a confirmation to ensure changes weren’t overridden. These customers were contacted proactively. ### **IMPACT** Some Rules containing Issue Edit, Issue Create, or Issue Clone actions in Jira products had the Advanced Fields section removed. These Rules continued to run as if the Advanced Fields section were empty. In addition, for affected Issue Edit actions, Jira notifications were not sent. Some Rules using the Send Email action in Jira and Confluence had the “From name” field removed. Emails continued to be sent but fell back to the Automation default for the “From name” field. ### **ROOT CAUSE** We were in the process of rolling out a new feature for Automation, which required us to perform data upgrades, which could be risky as the current system would run the upgrades automatically upon Rule activity. In an attempt to minimize such risk, we used feature flags to better control when the upgrade occurred. However, there was an error in the feature flag configuration which resulted in the upgrades immediately kicking in. Additionally, there were bugs in these upgrades that resulted in overriding customers' saved values with default values. This incorrect configuration was used for Rule runs until recovery. ### **REMEDIAL ACTION PLAN & NEXT STEPS** We are prioritizing the following improvement actions to avoid repeating this type of incident: * We are changing the approach for upgrading Rule configurations to allow better testing and prevent accidental upgrades. * Improving feature flag rollout and verification processes to avoid such incorrect configurations. * Increasing the frequency of backups of the Automation Rule data. We apologize to customers who were impacted during this incident; we are taking immediate steps to improve the reliability of Automation. Thanks, Atlassian Customer Support
Status: Postmortem
Impact: Major | Started At: Feb. 29, 2024, 8:22 a.m.
Description: Between 28th Feb 2024 23:15 UTC to 29th Feb 2024 00:05 UTC, we experienced issue with new product purchasing for all products. All new sign up products have been successfully provision and confirmed issue has been resolved and the service is operating normally.
Status: Resolved
Impact: Minor | Started At: Feb. 29, 2024, 1:27 a.m.
Description: Between 28th Feb 2024 23:15 UTC to 29th Feb 2024 00:05 UTC, we experienced issue with new product purchasing for all products. All new sign up products have been successfully provision and confirmed issue has been resolved and the service is operating normally.
Status: Resolved
Impact: Minor | Started At: Feb. 29, 2024, 1:27 a.m.
Description: ### Summary On February 28, 2024, between 12:17 UTC and 15:23 UTC, Jira and Confluence apps built on the Connect platform were unable to perform any actions on behalf of users. Some apps may have retried and later succeeded these actions, whereas others may have failed the request. The incident was detected within four minutes by automated monitoring of service reliability and mitigated by manual scaling of the service which put Atlassian systems into a known good state. The total time to resolution was about three hours and six minutes. ### **Technical Summary** On February 28, 2024, between 12:17 UTC and 15:23 UTC, Jira and Confluence apps built on the Connect platform were unable to perform token exchanges specifically for the purpose of user impersonation requests initiated by the app. The event was triggered by the failure of the oauth-2-authorization-server service to scale as the load increased. The unavailability of this service and apps retrying failing requests created a feedback loop, compounding the impacts of the service not scaling. The problem impacted customers in all regions. The incident was detected within four minutes by automated monitoring of service reliability and mitigated by manual scaling of the service which put Atlassian systems into a known good state. The total time to resolution was about three hours and six minutes. ### **IMPACT** The overall impact was on February 28, 2024, between 12:17 UTC and 15:23 UTC, and impacted Connect apps for Jira and Confluence products that relied on the user impersonation feature\_.\_ The incident caused service disruption to customers in all regions. Apps that made requests to act on behalf of users would have seen some of their requests failing throughout the incident. Where apps had retry mechanisms in place, these requests may have eventually succeeded once the service was in a good state. Impacted apps received HTTP 502 and 503 errors as well as request timeouts when making requests to the oauth-2-authorization-server service. Product functionality such as automation rules in Automation for Jira are partially built on the Connect platform, and some of these were impacted. During the impact window, Automation rules performing rule executions on behalf of a user instead of Automation for Jira failed to authenticate. Rules that failed were recorded in the Automation Audit Log. Additionally, manually triggered rules would have failed to trigger, these will not appear in the Automation Audit Log. Overall, this impacted approximately 2% of all rules run in the impact window. Automation for Confluence was not impacted. ### **ROOT CAUSE** The issue was caused by an increase in traffic to the oauth-2-authorization-server service in the US-East region and the service not autoscaling in response to the increased load. As the service began to fail requests, apps retried the requests, which further increased the service load. By adding additional processing resources \(scaling the nodes\), the service was able to handle the increased load and restore availability. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because the load conditions had not been encountered previously. The service has operated for many years in its current configuration and has never experienced this particular failure mode where traffic ramped faster than our ability to scale. As such, the scaling controls were not exercised and when required they did not proactively scale the oauth-2-authorization-server service due to the CPU scaling threshold never being reached. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity. We are prioritizing the following improvement actions to avoid repeating this type of incident: * The CPU threshold for scaling the service has been lowered significantly so that scaling will begin much earlier as the service load increases in each region. * We are updating our scaling policy to switch to step scaling in order to more rapidly scale capacity if there are significant load increases. * We have increased the minimum number of nodes for the service and will monitor service behaviour to see what should be the optimal minimum scaling value. * Further analysis of rate limiting being triggered will be undertaken to determine if apps are responding to rate limiting appropriately. The service rate limiting is described in [https://developer.atlassian.com/cloud/confluence/user-impersonation-for-connect-apps/#rate-limiting](https://developer.atlassian.com/cloud/confluence/user-impersonation-for-connect-apps/#rate-limiting). * Longer-term, network-based rate limiting will be explored to avoid a misbehaving app overloading the service. We apologize to customers, partners, and developers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability. Thanks, Atlassian Customer Support
Status: Postmortem
Impact: Critical | Started At: Feb. 28, 2024, 1:40 p.m.
Description: ### Summary On February 28, 2024, between 12:17 UTC and 15:23 UTC, Jira and Confluence apps built on the Connect platform were unable to perform any actions on behalf of users. Some apps may have retried and later succeeded these actions, whereas others may have failed the request. The incident was detected within four minutes by automated monitoring of service reliability and mitigated by manual scaling of the service which put Atlassian systems into a known good state. The total time to resolution was about three hours and six minutes. ### **Technical Summary** On February 28, 2024, between 12:17 UTC and 15:23 UTC, Jira and Confluence apps built on the Connect platform were unable to perform token exchanges specifically for the purpose of user impersonation requests initiated by the app. The event was triggered by the failure of the oauth-2-authorization-server service to scale as the load increased. The unavailability of this service and apps retrying failing requests created a feedback loop, compounding the impacts of the service not scaling. The problem impacted customers in all regions. The incident was detected within four minutes by automated monitoring of service reliability and mitigated by manual scaling of the service which put Atlassian systems into a known good state. The total time to resolution was about three hours and six minutes. ### **IMPACT** The overall impact was on February 28, 2024, between 12:17 UTC and 15:23 UTC, and impacted Connect apps for Jira and Confluence products that relied on the user impersonation feature\_.\_ The incident caused service disruption to customers in all regions. Apps that made requests to act on behalf of users would have seen some of their requests failing throughout the incident. Where apps had retry mechanisms in place, these requests may have eventually succeeded once the service was in a good state. Impacted apps received HTTP 502 and 503 errors as well as request timeouts when making requests to the oauth-2-authorization-server service. Product functionality such as automation rules in Automation for Jira are partially built on the Connect platform, and some of these were impacted. During the impact window, Automation rules performing rule executions on behalf of a user instead of Automation for Jira failed to authenticate. Rules that failed were recorded in the Automation Audit Log. Additionally, manually triggered rules would have failed to trigger, these will not appear in the Automation Audit Log. Overall, this impacted approximately 2% of all rules run in the impact window. Automation for Confluence was not impacted. ### **ROOT CAUSE** The issue was caused by an increase in traffic to the oauth-2-authorization-server service in the US-East region and the service not autoscaling in response to the increased load. As the service began to fail requests, apps retried the requests, which further increased the service load. By adding additional processing resources \(scaling the nodes\), the service was able to handle the increased load and restore availability. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because the load conditions had not been encountered previously. The service has operated for many years in its current configuration and has never experienced this particular failure mode where traffic ramped faster than our ability to scale. As such, the scaling controls were not exercised and when required they did not proactively scale the oauth-2-authorization-server service due to the CPU scaling threshold never being reached. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** We know that outages impact your productivity. We are prioritizing the following improvement actions to avoid repeating this type of incident: * The CPU threshold for scaling the service has been lowered significantly so that scaling will begin much earlier as the service load increases in each region. * We are updating our scaling policy to switch to step scaling in order to more rapidly scale capacity if there are significant load increases. * We have increased the minimum number of nodes for the service and will monitor service behaviour to see what should be the optimal minimum scaling value. * Further analysis of rate limiting being triggered will be undertaken to determine if apps are responding to rate limiting appropriately. The service rate limiting is described in [https://developer.atlassian.com/cloud/confluence/user-impersonation-for-connect-apps/#rate-limiting](https://developer.atlassian.com/cloud/confluence/user-impersonation-for-connect-apps/#rate-limiting). * Longer-term, network-based rate limiting will be explored to avoid a misbehaving app overloading the service. We apologize to customers, partners, and developers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability. Thanks, Atlassian Customer Support
Status: Postmortem
Impact: Critical | Started At: Feb. 28, 2024, 1:40 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.