Mezmo Status: Check if Mezmo down or having an outage.

Component	Status
Log Analysis	Active
Alerting	Active
Archiving	Active
Livetail	Active
Log Ingestion (Agent/REST API/Code Libraries)	Active
Log Ingestion (Heroku)	Active
Log Ingestion (Syslog)	Active
Search	Active
Web App	Active
Pipeline	Active
Destinations	Active
Ingestion / Sources	Active
Processors	Active
Web App	Active

Degraded performance for WebUI, Ingestion, Alerting, Searching, Live Tail, Graphing, and Timelines

Description: **Dates:** Start Time: Wednesday, October 5, 2022, at 14:27 UTC End Time: Wednesday, October 5, 2022, at 14:45 UTC Duration: 00:18 **What happened:** The ingestion of logs was partially halted. The WebUI was mostly unresponsive and most API calls failed. Because many newly submitted logs were not being ingested, new logs were not immediately available for Alerting, Searching, Live Tail, Graphing, Timelines, and Archiving. **Why it happened:** We recently added a new API gateway - Kong - to our service, that acts as a proxy for all other services. We had gradually increased the amount of traffic directed through the API gateway over several weeks and seen no ill effects. Prior to the incident, only some of the traffic for ingestion wen through the gateway. Kong was restarted after a routine configuration change. After the restart, all traffic for our ingestion service began to go through Kong. Our monitoring quickly revealed the Kong service did not have enough pods to keep up with the increased workload, causing many requests to fail. **How we fixed it:** We manually added more pods to the Kong service. Ingestion, the WebUI, and API calls began to work normally again. Once ingestion had resumed, LogDNA agents running on customer environments resent all locally cached logs to our service for ingestion. No data was lost. **What we are doing to prevent it from happening again:** We updated Kubernetes to always assign enough pods for the Kong API gateway service to be able to handle all traffic. We’ll update the Kong gateway to more evenly distribute ingestion traffic across available pods. We will adjust our deployment processes so pods are restarted more slowly, which will reduce the impact in a similar scenario. We’ll explore autoscaling policies so more pods could be added automatically in a similar situation.

Status: Postmortem

Impact: Minor | Started At: Oct. 5, 2022, 2:58 p.m.

Updates:

Time: Oct. 12, 2022, 6:52 p.m.

Status: Postmortem

Update: **Dates:** Start Time: Wednesday, October 5, 2022, at 14:27 UTC End Time: Wednesday, October 5, 2022, at 14:45 UTC Duration: 00:18 **What happened:** The ingestion of logs was partially halted. The WebUI was mostly unresponsive and most API calls failed. Because many newly submitted logs were not being ingested, new logs were not immediately available for Alerting, Searching, Live Tail, Graphing, Timelines, and Archiving. **Why it happened:** We recently added a new API gateway - Kong - to our service, that acts as a proxy for all other services. We had gradually increased the amount of traffic directed through the API gateway over several weeks and seen no ill effects. Prior to the incident, only some of the traffic for ingestion wen through the gateway. Kong was restarted after a routine configuration change. After the restart, all traffic for our ingestion service began to go through Kong. Our monitoring quickly revealed the Kong service did not have enough pods to keep up with the increased workload, causing many requests to fail. **How we fixed it:** We manually added more pods to the Kong service. Ingestion, the WebUI, and API calls began to work normally again. Once ingestion had resumed, LogDNA agents running on customer environments resent all locally cached logs to our service for ingestion. No data was lost. **What we are doing to prevent it from happening again:** We updated Kubernetes to always assign enough pods for the Kong API gateway service to be able to handle all traffic. We’ll update the Kong gateway to more evenly distribute ingestion traffic across available pods. We will adjust our deployment processes so pods are restarted more slowly, which will reduce the impact in a similar scenario. We’ll explore autoscaling policies so more pods could be added automatically in a similar situation.
Time: Oct. 5, 2022, 4:05 p.m.

Status: Resolved

Update: This incident has been resolved. All services are fully operational.
Time: Oct. 5, 2022, 2:58 p.m.

Status: Monitoring

Update: Service is restored but we are still monitoring.

Some customers' logs are not currently being processed

Description: **Dates:** Start Time: Thursday, June 30, 21:40 UTC End Time: Thursday, June 30, 23:32 UTC Duration: 1 hour and 52 minutes **What happened:** Some log lines for some customers were discarded by our service. The log lines were successfully accepted by our ingestion service, but a downstream service – the parser – removed some of them. All further downstream services, such as Alerting, Live Tail, Searching, and Archiving never received these logs. In some cases, lines were received by Live Tail and were appended with the phrase “\(not retained\)”. The great majority of customers – 94.2% – were unaffected and had no log lines discarded. Approximately 3.5% had a relatively small number of log lines discarded. Approximately 2.3% had most or all of the log lines submitted during the incident discarded. **Why it happened:** We inadvertently released code into production that contained a bug in the parser service. This bug was known to us and in the process of being fixed in our development environment, but was not yet ready for release to production. The parser service is where exclusion rules are applied to recently submitted log lines that have been ingested but not yet passed to downstream services \(e.g. Alerting, Live Tail, Searching, and Archiving\). The bug made the parser exclude log lines that matched rules for inactive exclusion rules. This included exclusion rules made by customers in the past and then disabled. Customers with such rules had some log lines excluded: whichever lines matched the inactive rules. If those rules had the “Preserve these lines for live-tail and alerting” option enabled, then the excluded lines would still be processed for alerts and appear in Live Tail with the phrase “\(not retained\)” appended. This affected 3.5% of our customer accounts. The usage quota feature is implemented as a particular type of exclusion rule even though it is not presented in the UI as an exclusion rule. The bug made the parser exclude all log lines if the usage quota feature was enabled for an account. This affected 2.3% of our customer accounts. Our monitoring did not detect the decrease in lines being passed from the parser to downstream services because the change was within the range of normal fluctuation rates. These rates vary significantly as traffic changes and as customers choose to enable/disable exclusion rules. **How we fixed it:** We reverted the last release of parser code to the previous version. Once the previous version was deployed to all pods running the parser service, log lines stopped being discarded. **What we are doing to prevent it from happening again:** We added a code level test to ensure inactive exclusion rules are never applied by the parser \(such tests are part of our standard operating procedure\). We will review our release process to understand how the code containing the bug was moved into production and improve our processes to prevent a similar event in the future.

Status: Postmortem

Impact: Major | Started At: June 30, 2022, 9:36 p.m.

Updates:

Time: July 1, 2022, 10:40 p.m.

Status: Postmortem

Update: **Dates:** Start Time: Thursday, June 30, 21:40 UTC End Time: Thursday, June 30, 23:32 UTC Duration: 1 hour and 52 minutes **What happened:** Some log lines for some customers were discarded by our service. The log lines were successfully accepted by our ingestion service, but a downstream service – the parser – removed some of them. All further downstream services, such as Alerting, Live Tail, Searching, and Archiving never received these logs. In some cases, lines were received by Live Tail and were appended with the phrase “\(not retained\)”. The great majority of customers – 94.2% – were unaffected and had no log lines discarded. Approximately 3.5% had a relatively small number of log lines discarded. Approximately 2.3% had most or all of the log lines submitted during the incident discarded. **Why it happened:** We inadvertently released code into production that contained a bug in the parser service. This bug was known to us and in the process of being fixed in our development environment, but was not yet ready for release to production. The parser service is where exclusion rules are applied to recently submitted log lines that have been ingested but not yet passed to downstream services \(e.g. Alerting, Live Tail, Searching, and Archiving\). The bug made the parser exclude log lines that matched rules for inactive exclusion rules. This included exclusion rules made by customers in the past and then disabled. Customers with such rules had some log lines excluded: whichever lines matched the inactive rules. If those rules had the “Preserve these lines for live-tail and alerting” option enabled, then the excluded lines would still be processed for alerts and appear in Live Tail with the phrase “\(not retained\)” appended. This affected 3.5% of our customer accounts. The usage quota feature is implemented as a particular type of exclusion rule even though it is not presented in the UI as an exclusion rule. The bug made the parser exclude all log lines if the usage quota feature was enabled for an account. This affected 2.3% of our customer accounts. Our monitoring did not detect the decrease in lines being passed from the parser to downstream services because the change was within the range of normal fluctuation rates. These rates vary significantly as traffic changes and as customers choose to enable/disable exclusion rules. **How we fixed it:** We reverted the last release of parser code to the previous version. Once the previous version was deployed to all pods running the parser service, log lines stopped being discarded. **What we are doing to prevent it from happening again:** We added a code level test to ensure inactive exclusion rules are never applied by the parser \(such tests are part of our standard operating procedure\). We will review our release process to understand how the code containing the bug was moved into production and improve our processes to prevent a similar event in the future.
Time: June 30, 2022, 10:11 p.m.

Status: Resolved

Update: Newly submitted logs are now being processed and retained. Some logs submitted by some customers during the incident were discarded and not successfully retained. [Reference #2792]
Time: June 30, 2022, 9:36 p.m.

Status: Investigating

Update: Some logs submitted to our service in the last 1.5 hours have not been processed. We are taking remedial action now.

Some customers' logs are not currently being processed

Description: **Dates:** Start Time: Thursday, June 30, 21:40 UTC End Time: Thursday, June 30, 23:32 UTC Duration: 1 hour and 52 minutes **What happened:** Some log lines for some customers were discarded by our service. The log lines were successfully accepted by our ingestion service, but a downstream service – the parser – removed some of them. All further downstream services, such as Alerting, Live Tail, Searching, and Archiving never received these logs. In some cases, lines were received by Live Tail and were appended with the phrase “\(not retained\)”. The great majority of customers – 94.2% – were unaffected and had no log lines discarded. Approximately 3.5% had a relatively small number of log lines discarded. Approximately 2.3% had most or all of the log lines submitted during the incident discarded. **Why it happened:** We inadvertently released code into production that contained a bug in the parser service. This bug was known to us and in the process of being fixed in our development environment, but was not yet ready for release to production. The parser service is where exclusion rules are applied to recently submitted log lines that have been ingested but not yet passed to downstream services \(e.g. Alerting, Live Tail, Searching, and Archiving\). The bug made the parser exclude log lines that matched rules for inactive exclusion rules. This included exclusion rules made by customers in the past and then disabled. Customers with such rules had some log lines excluded: whichever lines matched the inactive rules. If those rules had the “Preserve these lines for live-tail and alerting” option enabled, then the excluded lines would still be processed for alerts and appear in Live Tail with the phrase “\(not retained\)” appended. This affected 3.5% of our customer accounts. The usage quota feature is implemented as a particular type of exclusion rule even though it is not presented in the UI as an exclusion rule. The bug made the parser exclude all log lines if the usage quota feature was enabled for an account. This affected 2.3% of our customer accounts. Our monitoring did not detect the decrease in lines being passed from the parser to downstream services because the change was within the range of normal fluctuation rates. These rates vary significantly as traffic changes and as customers choose to enable/disable exclusion rules. **How we fixed it:** We reverted the last release of parser code to the previous version. Once the previous version was deployed to all pods running the parser service, log lines stopped being discarded. **What we are doing to prevent it from happening again:** We added a code level test to ensure inactive exclusion rules are never applied by the parser \(such tests are part of our standard operating procedure\). We will review our release process to understand how the code containing the bug was moved into production and improve our processes to prevent a similar event in the future.

Status: Postmortem

Impact: Major | Started At: June 30, 2022, 9:36 p.m.

Updates:

Time: July 1, 2022, 10:40 p.m.

Status: Postmortem

Update: **Dates:** Start Time: Thursday, June 30, 21:40 UTC End Time: Thursday, June 30, 23:32 UTC Duration: 1 hour and 52 minutes **What happened:** Some log lines for some customers were discarded by our service. The log lines were successfully accepted by our ingestion service, but a downstream service – the parser – removed some of them. All further downstream services, such as Alerting, Live Tail, Searching, and Archiving never received these logs. In some cases, lines were received by Live Tail and were appended with the phrase “\(not retained\)”. The great majority of customers – 94.2% – were unaffected and had no log lines discarded. Approximately 3.5% had a relatively small number of log lines discarded. Approximately 2.3% had most or all of the log lines submitted during the incident discarded. **Why it happened:** We inadvertently released code into production that contained a bug in the parser service. This bug was known to us and in the process of being fixed in our development environment, but was not yet ready for release to production. The parser service is where exclusion rules are applied to recently submitted log lines that have been ingested but not yet passed to downstream services \(e.g. Alerting, Live Tail, Searching, and Archiving\). The bug made the parser exclude log lines that matched rules for inactive exclusion rules. This included exclusion rules made by customers in the past and then disabled. Customers with such rules had some log lines excluded: whichever lines matched the inactive rules. If those rules had the “Preserve these lines for live-tail and alerting” option enabled, then the excluded lines would still be processed for alerts and appear in Live Tail with the phrase “\(not retained\)” appended. This affected 3.5% of our customer accounts. The usage quota feature is implemented as a particular type of exclusion rule even though it is not presented in the UI as an exclusion rule. The bug made the parser exclude all log lines if the usage quota feature was enabled for an account. This affected 2.3% of our customer accounts. Our monitoring did not detect the decrease in lines being passed from the parser to downstream services because the change was within the range of normal fluctuation rates. These rates vary significantly as traffic changes and as customers choose to enable/disable exclusion rules. **How we fixed it:** We reverted the last release of parser code to the previous version. Once the previous version was deployed to all pods running the parser service, log lines stopped being discarded. **What we are doing to prevent it from happening again:** We added a code level test to ensure inactive exclusion rules are never applied by the parser \(such tests are part of our standard operating procedure\). We will review our release process to understand how the code containing the bug was moved into production and improve our processes to prevent a similar event in the future.
Time: June 30, 2022, 10:11 p.m.

Status: Resolved

Update: Newly submitted logs are now being processed and retained. Some logs submitted by some customers during the incident were discarded and not successfully retained. [Reference #2792]
Time: June 30, 2022, 9:36 p.m.

Status: Investigating

Update: Some logs submitted to our service in the last 1.5 hours have not been processed. We are taking remedial action now.

Log lines not ingested

Description: **Dates:** Start Time: Thursday, June 2, 2022, at 20:25 UTC End Time: Thursday, June 2, 2022, at 20:50 UTC Duration: 00:25 **What happened:** The ingestion of logs was halted for about 25 minutes. During that time, newly submitted logs were never ingested and therefore not available for Alerting, Searching, Live Tail, Graphing, Timelines, and Archiving. **Why it happened:** We manually reverted our ingester service to an older version \(to solve a minor problem unrelated to this incident\). During the procedure, the version of the container was reverted, but not the container’s configuration. Because of this versioning mismatch, logs from the ingester stopped being accepted by a downstream service \(the “buzzsaw broker”\). The ingester is currently not designed to confirm logs are accepted by downstream services; therefore it returned http 200 messages to our customer’s agents, indicating logs had been successfully received. At this point the agent discarded any locally cached log files. Consequently, all log lines sent during the incident \(25 minutes\) were never ingested. **How we fixed it:** We reverted the container’s configuration correctly, so it matched the version of the container itself. Ingestion began working normally again. **What we are doing to prevent it from happening again:** We will review and update our runbooks for reverting services to earlier versions to prevent similar mistakes. We also plan to automate the reversion process. We will add internal confirmations to the ingester so it is always certain log lines were received by downstream services. This will prevent the ingester from sending erroneous 200 messages back to the agent, should the ingester be unable to pass log lines downstream.

Status: Postmortem

Impact: Critical | Started At: June 2, 2022, 9:44 p.m.

Updates:

Time: June 3, 2022, 10:49 p.m.

Status: Postmortem

Update: **Dates:** Start Time: Thursday, June 2, 2022, at 20:25 UTC End Time: Thursday, June 2, 2022, at 20:50 UTC Duration: 00:25 **What happened:** The ingestion of logs was halted for about 25 minutes. During that time, newly submitted logs were never ingested and therefore not available for Alerting, Searching, Live Tail, Graphing, Timelines, and Archiving. **Why it happened:** We manually reverted our ingester service to an older version \(to solve a minor problem unrelated to this incident\). During the procedure, the version of the container was reverted, but not the container’s configuration. Because of this versioning mismatch, logs from the ingester stopped being accepted by a downstream service \(the “buzzsaw broker”\). The ingester is currently not designed to confirm logs are accepted by downstream services; therefore it returned http 200 messages to our customer’s agents, indicating logs had been successfully received. At this point the agent discarded any locally cached log files. Consequently, all log lines sent during the incident \(25 minutes\) were never ingested. **How we fixed it:** We reverted the container’s configuration correctly, so it matched the version of the container itself. Ingestion began working normally again. **What we are doing to prevent it from happening again:** We will review and update our runbooks for reverting services to earlier versions to prevent similar mistakes. We also plan to automate the reversion process. We will add internal confirmations to the ingester so it is always certain log lines were received by downstream services. This will prevent the ingester from sending erroneous 200 messages back to the agent, should the ingester be unable to pass log lines downstream.
Time: June 2, 2022, 10:05 p.m.

Status: Resolved

Update: Ingestion has resumed. All services are fully operational.
Time: June 2, 2022, 9:44 p.m.

Status: Identified

Update: Ingestion of log lines has halted from all sources. We are taking remedial action.

Log lines not ingested

Description: **Dates:** Start Time: Thursday, June 2, 2022, at 20:25 UTC End Time: Thursday, June 2, 2022, at 20:50 UTC Duration: 00:25 **What happened:** The ingestion of logs was halted for about 25 minutes. During that time, newly submitted logs were never ingested and therefore not available for Alerting, Searching, Live Tail, Graphing, Timelines, and Archiving. **Why it happened:** We manually reverted our ingester service to an older version \(to solve a minor problem unrelated to this incident\). During the procedure, the version of the container was reverted, but not the container’s configuration. Because of this versioning mismatch, logs from the ingester stopped being accepted by a downstream service \(the “buzzsaw broker”\). The ingester is currently not designed to confirm logs are accepted by downstream services; therefore it returned http 200 messages to our customer’s agents, indicating logs had been successfully received. At this point the agent discarded any locally cached log files. Consequently, all log lines sent during the incident \(25 minutes\) were never ingested. **How we fixed it:** We reverted the container’s configuration correctly, so it matched the version of the container itself. Ingestion began working normally again. **What we are doing to prevent it from happening again:** We will review and update our runbooks for reverting services to earlier versions to prevent similar mistakes. We also plan to automate the reversion process. We will add internal confirmations to the ingester so it is always certain log lines were received by downstream services. This will prevent the ingester from sending erroneous 200 messages back to the agent, should the ingester be unable to pass log lines downstream.

Status: Postmortem

Impact: Critical | Started At: June 2, 2022, 9:44 p.m.

Updates:

Time: June 3, 2022, 10:49 p.m.

Status: Postmortem

Update: **Dates:** Start Time: Thursday, June 2, 2022, at 20:25 UTC End Time: Thursday, June 2, 2022, at 20:50 UTC Duration: 00:25 **What happened:** The ingestion of logs was halted for about 25 minutes. During that time, newly submitted logs were never ingested and therefore not available for Alerting, Searching, Live Tail, Graphing, Timelines, and Archiving. **Why it happened:** We manually reverted our ingester service to an older version \(to solve a minor problem unrelated to this incident\). During the procedure, the version of the container was reverted, but not the container’s configuration. Because of this versioning mismatch, logs from the ingester stopped being accepted by a downstream service \(the “buzzsaw broker”\). The ingester is currently not designed to confirm logs are accepted by downstream services; therefore it returned http 200 messages to our customer’s agents, indicating logs had been successfully received. At this point the agent discarded any locally cached log files. Consequently, all log lines sent during the incident \(25 minutes\) were never ingested. **How we fixed it:** We reverted the container’s configuration correctly, so it matched the version of the container itself. Ingestion began working normally again. **What we are doing to prevent it from happening again:** We will review and update our runbooks for reverting services to earlier versions to prevent similar mistakes. We also plan to automate the reversion process. We will add internal confirmations to the ingester so it is always certain log lines were received by downstream services. This will prevent the ingester from sending erroneous 200 messages back to the agent, should the ingester be unable to pass log lines downstream.
Time: June 2, 2022, 10:05 p.m.

Status: Resolved

Update: Ingestion has resumed. All services are fully operational.
Time: June 2, 2022, 9:44 p.m.

Status: Identified

Update: Ingestion of log lines has halted from all sources. We are taking remedial action.

Is there an Mezmo outage?

Mezmo status: Systems Active

Mezmo outages and incidents

There have been 1 outages or incidents for Mezmo in the last 30 days.

Severity Breakdown:

Tired of searching for status updates?

Components and Services Monitored for Mezmo

Log Analysis

Pipeline

Latest Mezmo outages and incidents.

Degraded performance for WebUI, Ingestion, Alerting, Searching, Live Tail, Graphing, and Timelines

Updates:

Some customers' logs are not currently being processed

Updates:

Some customers' logs are not currently being processed

Updates:

Log lines not ingested

Updates:

Log lines not ingested

Updates:

Check the status of similar companies and alternatives to Mezmo

Hudl

OutSystems

Postman

Mendix

DigitalOcean

Bandwidth

DataRobot

Grafana Cloud

SmartBear Software

Test IO

Copado Solutions

CircleCI

Frequently Asked Questions - Mezmo

Is there a Mezmo outage?

Where can I find the official status page of Mezmo?

How can I get notified if Mezmo is down or experiencing an outage?

What does Mezmo do?

Start monitoring now!