Last checked: 7 minutes ago
Get notified about any outages, downtime or incidents for Mezmo and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Mezmo.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Log Analysis | Active |
Alerting | Active |
Archiving | Active |
Livetail | Active |
Log Ingestion (Agent/REST API/Code Libraries) | Active |
Log Ingestion (Heroku) | Active |
Log Ingestion (Syslog) | Active |
Search | Active |
Web App | Active |
Pipeline | Active |
Destinations | Active |
Ingestion / Sources | Active |
Processors | Active |
Web App | Active |
View the latest incidents for Mezmo and check for official updates:
Description: The incident has been resolved and the logs are accessible in web app.
Status: Resolved
Impact: Major | Started At: April 20, 2021, 9:04 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: March 24, 2021, 10:38 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: March 24, 2021, 10:38 p.m.
Description: **Dates:** Start Time: Thursday, March 4, 2021, at ~03:45 UTC End Time: Thursday, March 4, 2021, at ~08:20 UTC Duration: ~4:36:00 **What happened:** Our Web UI returned an error message "Request returned an error. Try again?" when users tried to perform a search query or use Live Tail in the Web UI. **Why it happened:** The pods that run our searching and Live Tail services were automatically terminated by our Kubernetes orchestration system. Upon investigation, we discovered we had inadvertently classed these services as low priority. The incident occurred when a large number of other services that were classed as higher priority needed to run to meet usage demands. The orchestration system automatically terminated the lower priority services to make resources available for the higher priority services. More specifically, these pods were put into a “terminating” state. Normally this state is temporary -- a transition between “running” and “terminated”. During this incident, the pods remained in the “terminating” state permanently. Our monitoring detects services that have been “terminated”, but not ones that are in the temporary “terminating” state. Consequently, our infrastructure team was not notified. **How we fixed it:** We increased the priority of the pods that run our searching and Live Tail services to match the priority of other services. We updated the configuration of our orchestration system to make the change permanent. **What we are doing to prevent it from happening again:** We’ve already updated the configuration of our orchestration system to give services the correct priority. These changes are permanent and should prevent similar problems in the future.
Status: Postmortem
Impact: Major | Started At: March 4, 2021, 8 a.m.
Description: ## Dates: Start Time: Friday, February 26, 2021, at 06:43 UTC End Time: Friday, February 26, 2021, at 20:42 UTC ## What happened: The insertion of newly submitted logs stopped entirely for all accounts for about 3 hours. Logs were still available in Live Tail but not for searching, graphing, and timelines. The ingestion of logs from clients was not interrupted and no data was lost. For more than 95% of newly submitted logs, log processing returned to normal speeds within 3 hours. All logs submitted during the 3 hour pause were available again about 30 minutes later. For less than 5% of newly submitted logs, log processing returned to normal speeds gradually. Logs submitted during the 3 hour pause also gradually became available. This impact was limited to about 12% of accounts. The incident was closed when logs from all time periods for all accounts were entirely available. ## Why it happened: Our service ran out of a set of resources that manage pre-sharding on the clusters that store logs, an operation that ensures new logs are promptly inserted into the clusters. This happened because of several simultaneous changes to our infrastructure that didn’t account for the need for more resources, particularly on clusters with a relatively large number of shards relative to their overall storage capacity. The insertion of new logs slowed down and the backlog of unprocessed logs grew. Eventually, the portion of our service that processes new logs was unable to keep up with demand. ## How we fixed it: We restarted the portion of our service that processes newly submitted logs. During the recovery, we prioritized restoring logs submitted in the last day. 95% of accounts were fully recovered after 3.5 hours. ## What we are doing to prevent it from happening again: We’ve increased the scale of the set of resources that ensure logs are processed promptly by adding more servers for these resources to run upon. We’ve also added alerting for when these resources are reaching their limit.
Status: Postmortem
Impact: Minor | Started At: Feb. 26, 2021, 6:43 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.