Last checked: 5 seconds ago
Get notified about any outages, downtime or incidents for getstream.io and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for getstream.io.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Brazil (São Paulo) | Active |
Chat - Edge | Active |
Canada | Active |
Chat - Edge | Active |
Dublin | Active |
Chat - API | Active |
Chat - Edge | Active |
Feed - API | Active |
Frankfurt | Active |
Chat - Edge | Active |
Global services | Active |
CDN | Active |
Dashboard | Active |
Edge | Active |
Mumbai | Active |
Chat - API | Active |
Chat - Edge | Active |
Feed - API | Active |
Ohio | Active |
Chat - API | Active |
Chat - Edge | Active |
Singapore | Active |
Chat - API | Active |
Chat - Edge | Active |
Feed - API | Active |
South Africa (Cape Town) | Active |
Chat - Edge | Active |
Sydney | Active |
Chat - API | Active |
Chat - Edge | Active |
Tokyo | Active |
Chat - Edge | Active |
Feed - API | Active |
US-East | Active |
Chat - API | Active |
Chat - Edge | Active |
Feed - API | Active |
Feed - Personalization | Active |
Feed - Realtime notifications | Active |
View the latest incidents for getstream.io and check for official updates:
Description: We experienced higher than normal error rates during a database maintenance on Chat API. The error increased started at 5:24AM and resolved at 5:42AM UTC.
Status: Resolved
Impact: Major | Started At: April 14, 2021, 5:30 a.m.
Description: High error rate on Chat HTTP APIs
Status: Resolved
Impact: Major | Started At: March 15, 2021, 6:30 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Major | Started At: Jan. 6, 2021, 5 a.m.
Description: Millions of requests to the handshake endpoint of our feed realtime system broke the API. This issue has been resolved and a full post mortem will follow.
Status: Resolved
Impact: Critical | Started At: Jan. 4, 2021, 4:57 p.m.
Description: We have completed the post mortem for the December 9th incident. As the founder and CEO of Stream I’d like to apologize to all of our customers impacted by this issue. Stream powers activity feeds and chat for a billion end users, and we recognize that our customers operating in important sectors, such as healthcare, education, finance, and social apps, rely on our technology. As such, we have a responsibility to ensure that these systems are always available. Stability and performance is the cornerstone of what makes a hosted API like Stream work. Over the last 5 years it’s been extremely rare for us to have stability issues. Our team spends a significant amount of time and resources to ensure that we keep up our good stability track record. On December 9th, however, we made some significant mistakes, and we need to learn from that, as a team, and do better in the future. **The Outage** A rolling deployment between 11:28 GTM & 14:38 GMT was made to chat shards in US-east and Singapore regions. The code contained an issue with our Raft-based replication system, causing 66% of message events to not be delivered. Messages were still stored and retrievable via the API. The event replay endpoint also still returned messages. At 17:00 GMT the issue was identified and the code was rolled back, resolving the issue for all shards by 17:38 GMT. While the end user impact on the chat experience depends on the SDK, the offline storage integration, and the API region, for most apps, this meant a very significant disruption to the chat functionality. **What Went Wrong** As with any significant downtime event, it was a combination of problems that caused this outage. 1. The issue with the broken code should have been caught during our review process. 2. The QA process should have identified this issue. Unfortunately tests were run on a single node setup and did not capture the bug. 3. The issue with the reduced message events should have been visible during the rolling deploy. 4. Monitoring and alerting should have picked up the issue before our customers reported it. **Resolution 1 - Monitoring** The biggest and most glaring issue here is the monitoring. While we do have extensive monitoring and alerting in place, we did not have one that captured message propagation. The team is introducing monitoring to track message delivery and adding alerting rules. **Resolution 2 - QA** The second issue is that our extensive QA test suite didn’t catch this issue, since it only occurred when running Stream in a multi cluster environment. We are updating our QA process to run in a cluster environment, so that it more closely resembles production systems. **Resolution 3 - Heartbeat Monitoring** The previous two resolutions would have been enough to avoid this incident or reduce it to a very minor incident. With that being said, Chat API is a complex system and we think that more end-to-end testing will make issues easier to notice. For this reason we are also going to introduce canary-like testing so that we can detect failures at client-side level as well. **Non-Technical Factors** Stream has been growing extremely rapidly over the last year. Our team grew from 31 to 93 in the last 12 months. The chat API usage has been growing even faster than that. Keeping up to this level of growth requires constant changes to processes and operations like monitoring and deployment. This is something we have to reflect on as a team and do better. **Conclusion** Performance and stability is one of our key focus areas and something we spend a significant part of our engineering efforts on. Yesterday we let our customers down. For that, Tommaso and I would like to apologize. The entire team at Stream will strive to do better in the future.
Status: Postmortem
Impact: Major | Started At: Dec. 9, 2020, 4:59 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.