Last checked: 8 minutes ago
Get notified about any outages, downtime or incidents for Mezmo and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Mezmo.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Log Analysis | Active |
Alerting | Active |
Archiving | Active |
Livetail | Active |
Log Ingestion (Agent/REST API/Code Libraries) | Active |
Log Ingestion (Heroku) | Active |
Log Ingestion (Syslog) | Active |
Search | Active |
Web App | Active |
Pipeline | Active |
Destinations | Active |
Ingestion / Sources | Active |
Processors | Active |
Web App | Active |
View the latest incidents for Mezmo and check for official updates:
Description: **Dates:** Start Time: Wednesday, October 5, 2022, at 14:27 UTC End Time: Wednesday, October 5, 2022, at 14:45 UTC Duration: 00:18 **What happened:** The ingestion of logs was partially halted. The WebUI was mostly unresponsive and most API calls failed. Because many newly submitted logs were not being ingested, new logs were not immediately available for Alerting, Searching, Live Tail, Graphing, Timelines, and Archiving. **Why it happened:** We recently added a new API gateway - Kong - to our service, that acts as a proxy for all other services. We had gradually increased the amount of traffic directed through the API gateway over several weeks and seen no ill effects. Prior to the incident, only some of the traffic for ingestion wen through the gateway. Kong was restarted after a routine configuration change. After the restart, all traffic for our ingestion service began to go through Kong. Our monitoring quickly revealed the Kong service did not have enough pods to keep up with the increased workload, causing many requests to fail. **How we fixed it:** We manually added more pods to the Kong service. Ingestion, the WebUI, and API calls began to work normally again. Once ingestion had resumed, LogDNA agents running on customer environments resent all locally cached logs to our service for ingestion. No data was lost. **What we are doing to prevent it from happening again:** We updated Kubernetes to always assign enough pods for the Kong API gateway service to be able to handle all traffic. We’ll update the Kong gateway to more evenly distribute ingestion traffic across available pods. We will adjust our deployment processes so pods are restarted more slowly, which will reduce the impact in a similar scenario. We’ll explore autoscaling policies so more pods could be added automatically in a similar situation.
Status: Postmortem
Impact: Minor | Started At: Oct. 5, 2022, 2:58 p.m.
Description: **Dates:** Start Time: Thursday, June 30, 21:40 UTC End Time: Thursday, June 30, 23:32 UTC Duration: 1 hour and 52 minutes **What happened:** Some log lines for some customers were discarded by our service. The log lines were successfully accepted by our ingestion service, but a downstream service – the parser – removed some of them. All further downstream services, such as Alerting, Live Tail, Searching, and Archiving never received these logs. In some cases, lines were received by Live Tail and were appended with the phrase “\(not retained\)”. The great majority of customers – 94.2% – were unaffected and had no log lines discarded. Approximately 3.5% had a relatively small number of log lines discarded. Approximately 2.3% had most or all of the log lines submitted during the incident discarded. **Why it happened:** We inadvertently released code into production that contained a bug in the parser service. This bug was known to us and in the process of being fixed in our development environment, but was not yet ready for release to production. The parser service is where exclusion rules are applied to recently submitted log lines that have been ingested but not yet passed to downstream services \(e.g. Alerting, Live Tail, Searching, and Archiving\). The bug made the parser exclude log lines that matched rules for inactive exclusion rules. This included exclusion rules made by customers in the past and then disabled. Customers with such rules had some log lines excluded: whichever lines matched the inactive rules. If those rules had the “Preserve these lines for live-tail and alerting” option enabled, then the excluded lines would still be processed for alerts and appear in Live Tail with the phrase “\(not retained\)” appended. This affected 3.5% of our customer accounts. The usage quota feature is implemented as a particular type of exclusion rule even though it is not presented in the UI as an exclusion rule. The bug made the parser exclude all log lines if the usage quota feature was enabled for an account. This affected 2.3% of our customer accounts. Our monitoring did not detect the decrease in lines being passed from the parser to downstream services because the change was within the range of normal fluctuation rates. These rates vary significantly as traffic changes and as customers choose to enable/disable exclusion rules. **How we fixed it:** We reverted the last release of parser code to the previous version. Once the previous version was deployed to all pods running the parser service, log lines stopped being discarded. **What we are doing to prevent it from happening again:** We added a code level test to ensure inactive exclusion rules are never applied by the parser \(such tests are part of our standard operating procedure\). We will review our release process to understand how the code containing the bug was moved into production and improve our processes to prevent a similar event in the future.
Status: Postmortem
Impact: Major | Started At: June 30, 2022, 9:36 p.m.
Description: **Dates:** Start Time: Thursday, June 30, 21:40 UTC End Time: Thursday, June 30, 23:32 UTC Duration: 1 hour and 52 minutes **What happened:** Some log lines for some customers were discarded by our service. The log lines were successfully accepted by our ingestion service, but a downstream service – the parser – removed some of them. All further downstream services, such as Alerting, Live Tail, Searching, and Archiving never received these logs. In some cases, lines were received by Live Tail and were appended with the phrase “\(not retained\)”. The great majority of customers – 94.2% – were unaffected and had no log lines discarded. Approximately 3.5% had a relatively small number of log lines discarded. Approximately 2.3% had most or all of the log lines submitted during the incident discarded. **Why it happened:** We inadvertently released code into production that contained a bug in the parser service. This bug was known to us and in the process of being fixed in our development environment, but was not yet ready for release to production. The parser service is where exclusion rules are applied to recently submitted log lines that have been ingested but not yet passed to downstream services \(e.g. Alerting, Live Tail, Searching, and Archiving\). The bug made the parser exclude log lines that matched rules for inactive exclusion rules. This included exclusion rules made by customers in the past and then disabled. Customers with such rules had some log lines excluded: whichever lines matched the inactive rules. If those rules had the “Preserve these lines for live-tail and alerting” option enabled, then the excluded lines would still be processed for alerts and appear in Live Tail with the phrase “\(not retained\)” appended. This affected 3.5% of our customer accounts. The usage quota feature is implemented as a particular type of exclusion rule even though it is not presented in the UI as an exclusion rule. The bug made the parser exclude all log lines if the usage quota feature was enabled for an account. This affected 2.3% of our customer accounts. Our monitoring did not detect the decrease in lines being passed from the parser to downstream services because the change was within the range of normal fluctuation rates. These rates vary significantly as traffic changes and as customers choose to enable/disable exclusion rules. **How we fixed it:** We reverted the last release of parser code to the previous version. Once the previous version was deployed to all pods running the parser service, log lines stopped being discarded. **What we are doing to prevent it from happening again:** We added a code level test to ensure inactive exclusion rules are never applied by the parser \(such tests are part of our standard operating procedure\). We will review our release process to understand how the code containing the bug was moved into production and improve our processes to prevent a similar event in the future.
Status: Postmortem
Impact: Major | Started At: June 30, 2022, 9:36 p.m.
Description: **Dates:** Start Time: Thursday, June 2, 2022, at 20:25 UTC End Time: Thursday, June 2, 2022, at 20:50 UTC Duration: 00:25 **What happened:** The ingestion of logs was halted for about 25 minutes. During that time, newly submitted logs were never ingested and therefore not available for Alerting, Searching, Live Tail, Graphing, Timelines, and Archiving. **Why it happened:** We manually reverted our ingester service to an older version \(to solve a minor problem unrelated to this incident\). During the procedure, the version of the container was reverted, but not the container’s configuration. Because of this versioning mismatch, logs from the ingester stopped being accepted by a downstream service \(the “buzzsaw broker”\). The ingester is currently not designed to confirm logs are accepted by downstream services; therefore it returned http 200 messages to our customer’s agents, indicating logs had been successfully received. At this point the agent discarded any locally cached log files. Consequently, all log lines sent during the incident \(25 minutes\) were never ingested. **How we fixed it:** We reverted the container’s configuration correctly, so it matched the version of the container itself. Ingestion began working normally again. **What we are doing to prevent it from happening again:** We will review and update our runbooks for reverting services to earlier versions to prevent similar mistakes. We also plan to automate the reversion process. We will add internal confirmations to the ingester so it is always certain log lines were received by downstream services. This will prevent the ingester from sending erroneous 200 messages back to the agent, should the ingester be unable to pass log lines downstream.
Status: Postmortem
Impact: Critical | Started At: June 2, 2022, 9:44 p.m.
Description: **Dates:** Start Time: Thursday, June 2, 2022, at 20:25 UTC End Time: Thursday, June 2, 2022, at 20:50 UTC Duration: 00:25 **What happened:** The ingestion of logs was halted for about 25 minutes. During that time, newly submitted logs were never ingested and therefore not available for Alerting, Searching, Live Tail, Graphing, Timelines, and Archiving. **Why it happened:** We manually reverted our ingester service to an older version \(to solve a minor problem unrelated to this incident\). During the procedure, the version of the container was reverted, but not the container’s configuration. Because of this versioning mismatch, logs from the ingester stopped being accepted by a downstream service \(the “buzzsaw broker”\). The ingester is currently not designed to confirm logs are accepted by downstream services; therefore it returned http 200 messages to our customer’s agents, indicating logs had been successfully received. At this point the agent discarded any locally cached log files. Consequently, all log lines sent during the incident \(25 minutes\) were never ingested. **How we fixed it:** We reverted the container’s configuration correctly, so it matched the version of the container itself. Ingestion began working normally again. **What we are doing to prevent it from happening again:** We will review and update our runbooks for reverting services to earlier versions to prevent similar mistakes. We also plan to automate the reversion process. We will add internal confirmations to the ingester so it is always certain log lines were received by downstream services. This will prevent the ingester from sending erroneous 200 messages back to the agent, should the ingester be unable to pass log lines downstream.
Status: Postmortem
Impact: Critical | Started At: June 2, 2022, 9:44 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.