Last checked: 6 minutes ago
Get notified about any outages, downtime or incidents for Guard and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Guard.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Account Management | Active |
API tokens | Active |
Audit Logs | Active |
Domain Claims | Active |
SAML-based SSO | Active |
Signup | Active |
User Provisioning | Active |
Guard Premium | Active |
Data Classification | Active |
Data Security Policies | Active |
Guard Detect | Active |
View the latest incidents for Guard and check for official updates:
Description: This issue has been resolved and services are operating normally.
Status: Resolved
Impact: None | Started At: Dec. 18, 2020, 9:48 p.m.
Description: This issue has been resolved and services are operating normally.
Status: Resolved
Impact: None | Started At: Dec. 18, 2020, 9:48 p.m.
Description: ### **SUMMARY** From 13:57 on November 25th, 2020 - 19:50 on November 27th, 2020 UTC, a portion of data synchronization within Atlassian systems was delayed by up to 54 hours, with a subset of real-time customer functionality being down for the first 15 hours of the timeframe. The incident was caused by an outage to AWS service that Atlassian cloud infrastructure depends on. Customers of Atlassian’s Cloud Platform observed the following impact: Across multiple products, users experienced delays in completion of new sign-ups, user deletions, authentication and authorization policy changes, updating of search results, propagating product-emitted triggers to Forge apps, activity and in-app notifications, features behind a personalized rollout flag being served incorrectly, and inability to at-mention new users who signed up. In addition, service was downgraded for the following product capabilities: * Jira - Automation rules for Jira were delayed in being enacted, and activity details did not propagate accurately to Jira’s Your Work page and [start.atlassian.com](http://start.atlassian.com) for the duration of the outage * Confluence - Search results and analytics functionality like page views were not updated; user permission changes were also lagging for the duration of the outage * Trello - Search results were not updated and user permission changes were lagging for the duration of the outage * Opsgenie - Logging out, user invites, and user access post on-boarding were delayed for the duration of the outage * Bitbucket - Delays in push and merge operations for the duration of the outage * Statuspage - User invites, new sign-up completion, and user permission changes were delayed for the duration of the outage The incident was detected within 8 minutes via our automated monitoring systems. We mitigated the impact by redirecting our internal asynchronous communication traffic from the US East region to the US West region which put our systems into a known good state. We were able to restore all product functionality for customers within 15 hours and the total time to resolution including clearing the backlog of data synchronization was about 54 hours & 19 minutes. ### **ROOT CAUSE** The event was triggered by a significant AWS outage \([[https://aws.amazon.com/fr/message/11201/](https://aws.amazon.com/fr/message/11201/)](https://aws.amazon.com/fr/message/11201)\) for 14 hours in the US East region. Atlassian’s Enterprise Service Bus \(ESB\) is the backbone for async communication between services and systems. The ESB has a hard dependency on AWS Kinesis, which was part of the AWS outage. As a result, a significant portion of the data flow within Atlassian systems was either delayed or could not succeed due to the data pipe that carries communications following a user’s activity being down. This outage impacted customers in across the globe. ### **TECHNICAL REASONS** Atlassian has many internal systems that perform follow-up actions after a user’s interaction with our products. Examples of such follow-up actions include propagating correct authentication and authorization policy updates, updating our search indexes, provisioning access for new users post sign-up, and automation triggering after a data update. All of these systems rely on being informed asynchronously via our Enterprise Event Bus about the prior action a user has taken, or a data change that has occurred. Our Enterprise Event Bus in turn is dependent on AWS Kinesis, a data processing platform that broadcasts messages between message producing systems and client systems interested in consuming a subset of messages each, depending on the client’s designated follow-up functionality. A total outage of AWS Kinesis in one of our major geographic regions, US East, led to a significant outage for Atlassian due to the inability to propagate any information within our systems via our Enterprise Event Bus. ### **REMEDIAL ACTIONS PLAN & NEXT STEPS** During the post-incident review, we have identified enhancements in our technical architecture, and resilience measures to counter failures of our Enterprise Service Bus and AWS Kinesis. Moving forward, to minimize a hard dependency on AWS Kinesis, we will implement automated migration of customer traffic to a Kinesis instance in another geographic region during an outage, and better retention of data at key stages of data flow within our systems to improve data synchronization posterity in case of an outage.
Status: Postmortem
Impact: None | Started At: Nov. 25, 2020, 6:31 p.m.
Description: Between 23:45 UTC to 02:19 UTC, some customers experienced intermittent failure connecting to Atlassian Cloud. The root cause was an increased DNS error rate from our infrastructure supplier. The supplier fixed the upstream issue and we have verified that the services have recovered. The conditions that caused the issue have been addressed and we are actively working on a permanent fix. The issue has been resolved and the service is operating normally.
Status: Resolved
Impact: Major | Started At: Oct. 22, 2019, 10:32 p.m.
Description: Between 23:45 UTC to 02:19 UTC, some customers experienced intermittent failure connecting to Atlassian Cloud. The root cause was an increased DNS error rate from our infrastructure supplier. The supplier fixed the upstream issue and we have verified that the services have recovered. The conditions that caused the issue have been addressed and we are actively working on a permanent fix. The issue has been resolved and the service is operating normally.
Status: Resolved
Impact: Major | Started At: Oct. 22, 2019, 10:32 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.