Company Logo

Is there an Mezmo outage?

Mezmo status: Systems Active

Last checked: 7 minutes ago

Get notified about any outages, downtime or incidents for Mezmo and 1800+ other cloud vendors. Monitor 10 companies, for free.

Subscribe for updates

Mezmo outages and incidents

Outage and incident data over the last 30 days for Mezmo.

There have been 1 outages or incidents for Mezmo in the last 30 days.

Severity Breakdown:

Tired of searching for status updates?

Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!

Sign Up Now

Components and Services Monitored for Mezmo

Outlogger tracks the status of these components for Xero:

Alerting Active
Archiving Active
Livetail Active
Log Ingestion (Agent/REST API/Code Libraries) Active
Log Ingestion (Heroku) Active
Log Ingestion (Syslog) Active
Search Active
Web App Active
Destinations Active
Ingestion / Sources Active
Processors Active
Web App Active
Component Status
Active
Alerting Active
Archiving Active
Livetail Active
Log Ingestion (Agent/REST API/Code Libraries) Active
Log Ingestion (Heroku) Active
Log Ingestion (Syslog) Active
Search Active
Web App Active
Active
Destinations Active
Ingestion / Sources Active
Processors Active
Web App Active

Latest Mezmo outages and incidents.

View the latest incidents for Mezmo and check for official updates:

Updates:

  • Time: May 8, 2023, 7:09 p.m.
    Status: Postmortem
    Update: **Dates:** Start Time: Monday, May 1, 2023, at 19:55 UTC End Time: Monday, May 1, 2023, at 20:11 UTC Duration: 16 minutes ‌ **What happened:** The WebUI was unresponsive, returning an error of “failure to get a peer from the ring-balancer.” **Why it happened:** All Mezmo services run within a service mesh. The portion of the mesh dedicated to the pods running our Mongo database began receiving many connection requests, more than its allocated CPU and memory could handle at once. This portion of the mesh \(which itself runs on pods\) quickly ran out of memory. This made the Mongo database unavailable to other services. The WebUI relies entirely on Mongo for account information and therefore became unresponsive, returning an error of “failure to get a peer from the ring-balancer.” While the immediate reason for the incident is clear, the root cause is still unknown. We suspect there was a change in user usage patterns \(e.g. increased traffic, login attempts, etc\) which triggered the incident. **How we fixed it:** We removed the WebUI from the service mesh. The Mongo service has more CPU and memory resources allocated to it and was able to accept the high level of connection requests successfully. WebUI usage immediately returned to normal. **What we are doing to prevent it from happening again:** We will change the default settings for the service mesh to allocate more CPU and memory resources, permanently. Afterwards, we will add the Mongo service back to the service mesh.
  • Time: May 1, 2023, 8:27 p.m.
    Status: Resolved
    Update: This incident has been resolved.
  • Time: May 1, 2023, 8:18 p.m.
    Status: Identified
    Update: The Web UI is not accessible.

Updates:

  • Time: Feb. 21, 2023, 11:27 p.m.
    Status: Postmortem
    Update: **Dates:** Start Time: Friday, February 10, 2023 at 16:45 UTC End Time: Friday, February 10, 2023 at 18:14 UTC Duration: 89 minutes **What happened:** Searches returned results slowly or not at all. No data was lost and ingestion was not halted. ‌ **Why it happened:** In a previous incident on February 6, 2023 \(more details at [https://status.mezmo.com/incidents/3yl9x1t7qcw5\),](https://status.mezmo.com/incidents/3yl9x1t7qcw5),) two pods storing logs were temporarily removed from the pool of pods available for receiving and inserting new batches of logs into our data store. The pods continued to return results for previously processed logs. We took this step because the pods had fallen behind on their tasks, which we believe was a consequence of an ungraceful shutdown during the incident. We gave the pods several days to catch up on tasks and then made them available for insertion of new logs into the data store again, a change we expected to have no impact. One of the pods immediately began integrity checks to confirm the same data existed on its local disk and on our S3 storage. As a side effect of the previous incident, the pod incorrectly determined that data was missing from the local disk and began sending http requests to our S3 storage to locate the missing data. In fact, the data in question is designed to only reside on local disk and was not supposed to be stored on S3. The requests failed with 404 errors when the data was not found on S3 \(as expected\). Every new attempt to retrieve search results generated another request. The rate of requests was high enough to slow down all requests related to search results within the pod’s zone \(one out of three total\). This led to search results being returned slowly or not at all. ‌ **How we fixed it:** We removed the pod from the pool available for receiving and inserting new batches of logs into our data store. The pod continued to return results for previously processed logs. ‌ **What we are doing to prevent it from happening again:** We marked this pod to remain unavailable for new logs until all previously processed logs on the pod have passed their retention period, whose maximum is 30 days. At that time, the pod will be rebuilt and begin accepting newly submitted logs again. We’ll fix the logic of our search engine so it doesn’t request data from S3 that is intentionally not stored there. This will prevent the widespread 404 errors that slowed down all searching, should a pod again incorrectly determine it is missing data from its local disk. We have added alerting and monitoring to detect high latency in search speeds and the average time to compact newly inserted logs.
  • Time: Feb. 10, 2023, 6:14 p.m.
    Status: Resolved
    Update: Searches are running at normal speeds again. All services are fully operational.
  • Time: Feb. 10, 2023, 6:11 p.m.
    Status: Identified
    Update: Searches are running slowly. We have identified the cause and are implementing a fix. (Reference # 3018)

Updates:

  • Time: Feb. 21, 2023, 11:27 p.m.
    Status: Postmortem
    Update: **Dates:** Start Time: Friday, February 10, 2023 at 16:45 UTC End Time: Friday, February 10, 2023 at 18:14 UTC Duration: 89 minutes **What happened:** Searches returned results slowly or not at all. No data was lost and ingestion was not halted. ‌ **Why it happened:** In a previous incident on February 6, 2023 \(more details at [https://status.mezmo.com/incidents/3yl9x1t7qcw5\),](https://status.mezmo.com/incidents/3yl9x1t7qcw5),) two pods storing logs were temporarily removed from the pool of pods available for receiving and inserting new batches of logs into our data store. The pods continued to return results for previously processed logs. We took this step because the pods had fallen behind on their tasks, which we believe was a consequence of an ungraceful shutdown during the incident. We gave the pods several days to catch up on tasks and then made them available for insertion of new logs into the data store again, a change we expected to have no impact. One of the pods immediately began integrity checks to confirm the same data existed on its local disk and on our S3 storage. As a side effect of the previous incident, the pod incorrectly determined that data was missing from the local disk and began sending http requests to our S3 storage to locate the missing data. In fact, the data in question is designed to only reside on local disk and was not supposed to be stored on S3. The requests failed with 404 errors when the data was not found on S3 \(as expected\). Every new attempt to retrieve search results generated another request. The rate of requests was high enough to slow down all requests related to search results within the pod’s zone \(one out of three total\). This led to search results being returned slowly or not at all. ‌ **How we fixed it:** We removed the pod from the pool available for receiving and inserting new batches of logs into our data store. The pod continued to return results for previously processed logs. ‌ **What we are doing to prevent it from happening again:** We marked this pod to remain unavailable for new logs until all previously processed logs on the pod have passed their retention period, whose maximum is 30 days. At that time, the pod will be rebuilt and begin accepting newly submitted logs again. We’ll fix the logic of our search engine so it doesn’t request data from S3 that is intentionally not stored there. This will prevent the widespread 404 errors that slowed down all searching, should a pod again incorrectly determine it is missing data from its local disk. We have added alerting and monitoring to detect high latency in search speeds and the average time to compact newly inserted logs.
  • Time: Feb. 10, 2023, 6:14 p.m.
    Status: Resolved
    Update: Searches are running at normal speeds again. All services are fully operational.
  • Time: Feb. 10, 2023, 6:11 p.m.
    Status: Identified
    Update: Searches are running slowly. We have identified the cause and are implementing a fix. (Reference # 3018)

Updates:

  • Time: Feb. 7, 2023, 11:44 p.m.
    Status: Postmortem
    Update: **Dates:** Start Time: Monday, February 6, 2023, at 20:05 UTC End Time: Tuesday, February 7, 2023, at 00:30 UTC Duration: 4 hours and 25 minutes **What happened:** Searches returned results slowly or not at all. Our Web UI was intermittently unresponsive, particularly for pages like Live Tail, Graphing, and Timelines. No data was lost and ingestion was not halted. **Why it happened:** We initiated an upgrade of all nodes in our service, including the nodes that store logs. Pods were gradually moved to other nodes and restarted, so as to prevent any interruption in service. A single pod that stores logs did not restart normally. Upon investigation, we found that it had not shut down cleanly and some files essential to a normal startup had not been written to disk. More significantly, we discovered that all nodes that store logs were using a podManagementPolicy of “orderedReady” \(the default setting\). This forced pods to restart in an ordered sequence. The single pod that would not restart was in the middle of the sequence; all the pods later in the sequence followed the policy and did not start either. In effect, about 25% of the pods within one zone \(out of the three zones devoted to storing logs\) were unable to start. The remaining pods in the zone were forced to take on extra work, such as accepting new logs, compacting data, and answering queries from our internal APIs. This led to slow searches and slow load times for any part of the Web UI that displays data about logs. **How we fixed it:** We temporarily added more pods to run API calls to increase the odds of them succeeding. We changed the podManagementPolicy to “Parallel” to allow all pods to restart, regardless of their position in the ordered sequence for starting up. We made manual edits to the pod that had not restarted cleanly so it could start again. These steps brought search latency back to normal speeds and made API calls work again. We cordoned off two pods that had fallen far behind in processing to allow them to recover without taking on new tasks. This temporarily removed ~2% of logs from all search results. When these pods were caught up with all pending tasks, we made them available again for search queries. ‌ **What we are doing to prevent it from happening again:** We have changed the podManagementPolicy to “Parallel” for all nodes that store logs. We will review the podManagementPolicy of all other areas of our service and make changes where appropriate. We will add alerting and monitoring to detect high latency in search speeds and the average time to compact newly inserted logs. We’ll explore options for adding more resources to each zone of pods, so they are less likely to fall behind on processing tasks when some pods are unavailable. We’ll explore ways to prevent unclean shutdowns of pods when nodes are upgraded.
  • Time: Feb. 7, 2023, 11:44 p.m.
    Status: Postmortem
    Update: **Dates:** Start Time: Monday, February 6, 2023, at 20:05 UTC End Time: Tuesday, February 7, 2023, at 00:30 UTC Duration: 4 hours and 25 minutes **What happened:** Searches returned results slowly or not at all. Our Web UI was intermittently unresponsive, particularly for pages like Live Tail, Graphing, and Timelines. No data was lost and ingestion was not halted. **Why it happened:** We initiated an upgrade of all nodes in our service, including the nodes that store logs. Pods were gradually moved to other nodes and restarted, so as to prevent any interruption in service. A single pod that stores logs did not restart normally. Upon investigation, we found that it had not shut down cleanly and some files essential to a normal startup had not been written to disk. More significantly, we discovered that all nodes that store logs were using a podManagementPolicy of “orderedReady” \(the default setting\). This forced pods to restart in an ordered sequence. The single pod that would not restart was in the middle of the sequence; all the pods later in the sequence followed the policy and did not start either. In effect, about 25% of the pods within one zone \(out of the three zones devoted to storing logs\) were unable to start. The remaining pods in the zone were forced to take on extra work, such as accepting new logs, compacting data, and answering queries from our internal APIs. This led to slow searches and slow load times for any part of the Web UI that displays data about logs. **How we fixed it:** We temporarily added more pods to run API calls to increase the odds of them succeeding. We changed the podManagementPolicy to “Parallel” to allow all pods to restart, regardless of their position in the ordered sequence for starting up. We made manual edits to the pod that had not restarted cleanly so it could start again. These steps brought search latency back to normal speeds and made API calls work again. We cordoned off two pods that had fallen far behind in processing to allow them to recover without taking on new tasks. This temporarily removed ~2% of logs from all search results. When these pods were caught up with all pending tasks, we made them available again for search queries. ‌ **What we are doing to prevent it from happening again:** We have changed the podManagementPolicy to “Parallel” for all nodes that store logs. We will review the podManagementPolicy of all other areas of our service and make changes where appropriate. We will add alerting and monitoring to detect high latency in search speeds and the average time to compact newly inserted logs. We’ll explore options for adding more resources to each zone of pods, so they are less likely to fall behind on processing tasks when some pods are unavailable. We’ll explore ways to prevent unclean shutdowns of pods when nodes are upgraded.
  • Time: Feb. 7, 2023, 2:07 a.m.
    Status: Resolved
    Update: All data is being returned in search results. All services are fully operational.
  • Time: Feb. 7, 2023, 2:07 a.m.
    Status: Resolved
    Update: All data is being returned in search results. All services are fully operational.
  • Time: Feb. 7, 2023, 12:16 a.m.
    Status: Identified
    Update: Web UI pages are loading at normal speeds and searches are returning quickly again. A small amount of data (<2%) is not being returned in search results. We are working to restore access to the results.
  • Time: Feb. 7, 2023, 12:16 a.m.
    Status: Identified
    Update: Web UI pages are loading at normal speeds and searches are returning quickly again. A small amount of data (<2%) is not being returned in search results. We are working to restore access to the results.
  • Time: Feb. 6, 2023, 10:26 p.m.
    Status: Identified
    Update: We are continuing to work on a fix for this incident.
  • Time: Feb. 6, 2023, 10:26 p.m.
    Status: Identified
    Update: We are continuing to work on a fix for this incident.
  • Time: Feb. 6, 2023, 8:48 p.m.
    Status: Identified
    Update: We are seeing intermittent delays loading the Web UI and running searches. We are taking remedial action.
  • Time: Feb. 6, 2023, 8:48 p.m.
    Status: Identified
    Update: We are seeing intermittent delays loading the Web UI and running searches. We are taking remedial action.

Updates:

  • Time: Oct. 12, 2022, 6:52 p.m.
    Status: Postmortem
    Update: **Dates:** Start Time: Wednesday, October 5, 2022, at 14:27 UTC End Time: Wednesday, October 5, 2022, at 14:45 UTC Duration: 00:18 **What happened:** The ingestion of logs was partially halted. The WebUI was mostly unresponsive and most API calls failed. Because many newly submitted logs were not being ingested, new logs were not immediately available for Alerting, Searching, Live Tail, Graphing, Timelines, and Archiving. **Why it happened:** We recently added a new API gateway - Kong - to our service, that acts as a proxy for all other services. We had gradually increased the amount of traffic directed through the API gateway over several weeks and seen no ill effects. Prior to the incident, only some of the traffic for ingestion wen through the gateway. Kong was restarted after a routine configuration change. After the restart, all traffic for our ingestion service began to go through Kong. Our monitoring quickly revealed the Kong service did not have enough pods to keep up with the increased workload, causing many requests to fail. **How we fixed it:** We manually added more pods to the Kong service. Ingestion, the WebUI, and API calls began to work normally again. Once ingestion had resumed, LogDNA agents running on customer environments resent all locally cached logs to our service for ingestion. No data was lost. **What we are doing to prevent it from happening again:** We updated Kubernetes to always assign enough pods for the Kong API gateway service to be able to handle all traffic. We’ll update the Kong gateway to more evenly distribute ingestion traffic across available pods. We will adjust our deployment processes so pods are restarted more slowly, which will reduce the impact in a similar scenario. We’ll explore autoscaling policies so more pods could be added automatically in a similar situation.
  • Time: Oct. 5, 2022, 4:05 p.m.
    Status: Resolved
    Update: This incident has been resolved. All services are fully operational.
  • Time: Oct. 5, 2022, 2:58 p.m.
    Status: Monitoring
    Update: Service is restored but we are still monitoring.

Check the status of similar companies and alternatives to Mezmo

Hudl
Hudl

Systems Active

OutSystems
OutSystems

Systems Active

Postman
Postman

Systems Active

Mendix
Mendix

Systems Active

DigitalOcean
DigitalOcean

Systems Active

Bandwidth
Bandwidth

Systems Active

DataRobot
DataRobot

Systems Active

Grafana Cloud
Grafana Cloud

Systems Active

SmartBear Software
SmartBear Software

Systems Active

Test IO
Test IO

Systems Active

Copado Solutions
Copado Solutions

Systems Active

CircleCI
CircleCI

Systems Active

Frequently Asked Questions - Mezmo

Is there a Mezmo outage?
The current status of Mezmo is: Systems Active
Where can I find the official status page of Mezmo?
The official status page for Mezmo is here
How can I get notified if Mezmo is down or experiencing an outage?
To get notified of any status changes to Mezmo, simply sign up to OutLogger's free monitoring service. OutLogger checks the official status of Mezmo every few minutes and will notify you of any changes. You can veiw the status of all your cloud vendors in one dashboard. Sign up here
What does Mezmo do?
Mezmo is a cloud-based tool that helps application owners manage and analyze important business data across different areas.