Last checked: 9 minutes ago
Get notified about any outages, downtime or incidents for Mezmo and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Mezmo.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Log Analysis | Active |
Alerting | Active |
Archiving | Active |
Livetail | Active |
Log Ingestion (Agent/REST API/Code Libraries) | Active |
Log Ingestion (Heroku) | Active |
Log Ingestion (Syslog) | Active |
Search | Active |
Web App | Active |
Pipeline | Active |
Destinations | Active |
Ingestion / Sources | Active |
Processors | Active |
Web App | Active |
View the latest incidents for Mezmo and check for official updates:
Description: **Dates:** Start Time: Tuesday, January 18, 2022, at 21:00:00 UTC End Time: Wednesday, January 19, 2022, at 05:30:00 UTC Duration: 8:30:00 **What happened:** Our Web UI returned an error when customers tried to login or load pages. The errors persisted for short intervals – about 1-2 minutes each – then returned to normal usage. There were about 20 such intervals over the course of 4\+ hours. The ingestion of logs was also halted during these 1-2 minute intervals. All LogDNA agents running on customer environments quickly resent the logs. Alerting was halted for the duration of the incident and new sessions of Live Tail could not be started. **Why it happened:** We updated our parser service, which required scaling down all pods and restarting them. A new feature of the parser is to flush memory to our Redis database upon restart. The new flushing worked as intended, but also overwhelmed the database and made it unavailable to other services. This caused the pods running our Web UI and ingestion service to go into a “Not Ready” state; our API gateway then stopped sending traffic to these pods. When customers tried to load pages in the Web UI, the API gateway returned an error. When the Redis database became unresponsive, our alerting service stopped working and new sessions of Live Tail could not be started. Our monitoring of these services was inadequate and we were not alerted. **How we fixed it:** Restarting the parser server was unavoidable. We split the restart process for the parser into small segments to keep the intervals of unavailability as short as possible. In practice, there were 20 small restarts over 4\+ hours, each causing 1-2 minutes of unavailability. The WebUI and the ingestion service were fully operational by January 19, 01:21:00 UTC. On January 19, 5:30 UTC we manually restarted the Alerting and Live Tail services, which then returned to normal usage. **What we are doing to prevent it from happening again:** We’ve added code to slow down the shutdown process for the parser service to stagger the impact on our Redis database over time. Restarting the parser is uncommon; we intend to run load tests of restarts before any future updates of the parser in production are necessary, to confirm Redis is no longer affected by the new flushing behavior. We will improve our monitoring to alert us when services like Live Tail and Alerting are not functioning.
Status: Postmortem
Impact: Minor | Started At: Jan. 18, 2022, 10:59 p.m.
Description: **Dates:** Start Time: Tuesday, January 18, 2022, at 21:00:00 UTC End Time: Wednesday, January 19, 2022, at 05:30:00 UTC Duration: 8:30:00 **What happened:** Our Web UI returned an error when customers tried to login or load pages. The errors persisted for short intervals – about 1-2 minutes each – then returned to normal usage. There were about 20 such intervals over the course of 4\+ hours. The ingestion of logs was also halted during these 1-2 minute intervals. All LogDNA agents running on customer environments quickly resent the logs. Alerting was halted for the duration of the incident and new sessions of Live Tail could not be started. **Why it happened:** We updated our parser service, which required scaling down all pods and restarting them. A new feature of the parser is to flush memory to our Redis database upon restart. The new flushing worked as intended, but also overwhelmed the database and made it unavailable to other services. This caused the pods running our Web UI and ingestion service to go into a “Not Ready” state; our API gateway then stopped sending traffic to these pods. When customers tried to load pages in the Web UI, the API gateway returned an error. When the Redis database became unresponsive, our alerting service stopped working and new sessions of Live Tail could not be started. Our monitoring of these services was inadequate and we were not alerted. **How we fixed it:** Restarting the parser server was unavoidable. We split the restart process for the parser into small segments to keep the intervals of unavailability as short as possible. In practice, there were 20 small restarts over 4\+ hours, each causing 1-2 minutes of unavailability. The WebUI and the ingestion service were fully operational by January 19, 01:21:00 UTC. On January 19, 5:30 UTC we manually restarted the Alerting and Live Tail services, which then returned to normal usage. **What we are doing to prevent it from happening again:** We’ve added code to slow down the shutdown process for the parser service to stagger the impact on our Redis database over time. Restarting the parser is uncommon; we intend to run load tests of restarts before any future updates of the parser in production are necessary, to confirm Redis is no longer affected by the new flushing behavior. We will improve our monitoring to alert us when services like Live Tail and Alerting are not functioning.
Status: Postmortem
Impact: Minor | Started At: Jan. 18, 2022, 10:59 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: Jan. 3, 2022, 7:12 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: Jan. 3, 2022, 7:12 p.m.
Description: **Dates:** Start Time: Tuesday, November 23, 2021, at 16:42 UTC End Time: Wednesday, November 24, 2021, at 17:00 UTC Duration: 24:18:00 **What happened:** Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines. Some accounts \(about 25%\) were affected more than others. For all accounts, the ingestion of logs was not interrupted and no data was lost. **Why it happened:** Upon investigation, we discovered that the service which parses all incoming log lines was working very slowly. This service is upstream to all our other services, such as alerting, live tail, archiving, and searching; consequently, all those services were also delayed. We isolated the slow parsing to the specific content of certain log lines. These log lines exposed an inefficiency in our line parsing service which resulted in exponential growth in the time needed to parse those lines; this in turn created a bottleneck that delayed the parsing of other log lines. The inefficiency has been present for some time, but went undetected until one account started sending a large volume of these problematic lines. **How we fixed it:** The line parsing service was updated to use a new algorithm that avoids the worst-case behaviors of the original, as well as improving performance for line parsing in general. From then on, the parsing service just needed time to process the backlog of logs sent to us by customers. Likewise, the downstream services – alerting, live tail, archiving, searching – needed time to process the logs now being sent to them by the parsing service. The recovery was quicker for about 75% of our customers and slower for the other 25%. **What we are doing to prevent it from happening again:** The new parsing methodology has improved our overall performance significantly. We are also actively pursuing further optimizations.
Status: Postmortem
Impact: Minor | Started At: Nov. 23, 2021, 4:42 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.