Last checked: 7 minutes ago
Get notified about any outages, downtime or incidents for Mezmo and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Mezmo.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Log Analysis | Active |
Alerting | Active |
Archiving | Active |
Livetail | Active |
Log Ingestion (Agent/REST API/Code Libraries) | Active |
Log Ingestion (Heroku) | Active |
Log Ingestion (Syslog) | Active |
Search | Active |
Web App | Active |
Pipeline | Active |
Destinations | Active |
Ingestion / Sources | Active |
Processors | Active |
Web App | Active |
View the latest incidents for Mezmo and check for official updates:
Description: **Dates:** Start Time: Monday, May 1, 2023, at 19:55 UTC End Time: Monday, May 1, 2023, at 20:11 UTC Duration: 16 minutes **What happened:** The WebUI was unresponsive, returning an error of “failure to get a peer from the ring-balancer.” **Why it happened:** All Mezmo services run within a service mesh. The portion of the mesh dedicated to the pods running our Mongo database began receiving many connection requests, more than its allocated CPU and memory could handle at once. This portion of the mesh \(which itself runs on pods\) quickly ran out of memory. This made the Mongo database unavailable to other services. The WebUI relies entirely on Mongo for account information and therefore became unresponsive, returning an error of “failure to get a peer from the ring-balancer.” While the immediate reason for the incident is clear, the root cause is still unknown. We suspect there was a change in user usage patterns \(e.g. increased traffic, login attempts, etc\) which triggered the incident. **How we fixed it:** We removed the WebUI from the service mesh. The Mongo service has more CPU and memory resources allocated to it and was able to accept the high level of connection requests successfully. WebUI usage immediately returned to normal. **What we are doing to prevent it from happening again:** We will change the default settings for the service mesh to allocate more CPU and memory resources, permanently. Afterwards, we will add the Mongo service back to the service mesh.
Status: Postmortem
Impact: Critical | Started At: May 1, 2023, 8:18 p.m.
Description: **Dates:** Start Time: Friday, February 10, 2023 at 16:45 UTC End Time: Friday, February 10, 2023 at 18:14 UTC Duration: 89 minutes **What happened:** Searches returned results slowly or not at all. No data was lost and ingestion was not halted. **Why it happened:** In a previous incident on February 6, 2023 \(more details at [https://status.mezmo.com/incidents/3yl9x1t7qcw5\),](https://status.mezmo.com/incidents/3yl9x1t7qcw5),) two pods storing logs were temporarily removed from the pool of pods available for receiving and inserting new batches of logs into our data store. The pods continued to return results for previously processed logs. We took this step because the pods had fallen behind on their tasks, which we believe was a consequence of an ungraceful shutdown during the incident. We gave the pods several days to catch up on tasks and then made them available for insertion of new logs into the data store again, a change we expected to have no impact. One of the pods immediately began integrity checks to confirm the same data existed on its local disk and on our S3 storage. As a side effect of the previous incident, the pod incorrectly determined that data was missing from the local disk and began sending http requests to our S3 storage to locate the missing data. In fact, the data in question is designed to only reside on local disk and was not supposed to be stored on S3. The requests failed with 404 errors when the data was not found on S3 \(as expected\). Every new attempt to retrieve search results generated another request. The rate of requests was high enough to slow down all requests related to search results within the pod’s zone \(one out of three total\). This led to search results being returned slowly or not at all. **How we fixed it:** We removed the pod from the pool available for receiving and inserting new batches of logs into our data store. The pod continued to return results for previously processed logs. **What we are doing to prevent it from happening again:** We marked this pod to remain unavailable for new logs until all previously processed logs on the pod have passed their retention period, whose maximum is 30 days. At that time, the pod will be rebuilt and begin accepting newly submitted logs again. We’ll fix the logic of our search engine so it doesn’t request data from S3 that is intentionally not stored there. This will prevent the widespread 404 errors that slowed down all searching, should a pod again incorrectly determine it is missing data from its local disk. We have added alerting and monitoring to detect high latency in search speeds and the average time to compact newly inserted logs.
Status: Postmortem
Impact: Major | Started At: Feb. 10, 2023, 6:11 p.m.
Description: **Dates:** Start Time: Friday, February 10, 2023 at 16:45 UTC End Time: Friday, February 10, 2023 at 18:14 UTC Duration: 89 minutes **What happened:** Searches returned results slowly or not at all. No data was lost and ingestion was not halted. **Why it happened:** In a previous incident on February 6, 2023 \(more details at [https://status.mezmo.com/incidents/3yl9x1t7qcw5\),](https://status.mezmo.com/incidents/3yl9x1t7qcw5),) two pods storing logs were temporarily removed from the pool of pods available for receiving and inserting new batches of logs into our data store. The pods continued to return results for previously processed logs. We took this step because the pods had fallen behind on their tasks, which we believe was a consequence of an ungraceful shutdown during the incident. We gave the pods several days to catch up on tasks and then made them available for insertion of new logs into the data store again, a change we expected to have no impact. One of the pods immediately began integrity checks to confirm the same data existed on its local disk and on our S3 storage. As a side effect of the previous incident, the pod incorrectly determined that data was missing from the local disk and began sending http requests to our S3 storage to locate the missing data. In fact, the data in question is designed to only reside on local disk and was not supposed to be stored on S3. The requests failed with 404 errors when the data was not found on S3 \(as expected\). Every new attempt to retrieve search results generated another request. The rate of requests was high enough to slow down all requests related to search results within the pod’s zone \(one out of three total\). This led to search results being returned slowly or not at all. **How we fixed it:** We removed the pod from the pool available for receiving and inserting new batches of logs into our data store. The pod continued to return results for previously processed logs. **What we are doing to prevent it from happening again:** We marked this pod to remain unavailable for new logs until all previously processed logs on the pod have passed their retention period, whose maximum is 30 days. At that time, the pod will be rebuilt and begin accepting newly submitted logs again. We’ll fix the logic of our search engine so it doesn’t request data from S3 that is intentionally not stored there. This will prevent the widespread 404 errors that slowed down all searching, should a pod again incorrectly determine it is missing data from its local disk. We have added alerting and monitoring to detect high latency in search speeds and the average time to compact newly inserted logs.
Status: Postmortem
Impact: Major | Started At: Feb. 10, 2023, 6:11 p.m.
Description: **Dates:** Start Time: Monday, February 6, 2023, at 20:05 UTC End Time: Tuesday, February 7, 2023, at 00:30 UTC Duration: 4 hours and 25 minutes **What happened:** Searches returned results slowly or not at all. Our Web UI was intermittently unresponsive, particularly for pages like Live Tail, Graphing, and Timelines. No data was lost and ingestion was not halted. **Why it happened:** We initiated an upgrade of all nodes in our service, including the nodes that store logs. Pods were gradually moved to other nodes and restarted, so as to prevent any interruption in service. A single pod that stores logs did not restart normally. Upon investigation, we found that it had not shut down cleanly and some files essential to a normal startup had not been written to disk. More significantly, we discovered that all nodes that store logs were using a podManagementPolicy of “orderedReady” \(the default setting\). This forced pods to restart in an ordered sequence. The single pod that would not restart was in the middle of the sequence; all the pods later in the sequence followed the policy and did not start either. In effect, about 25% of the pods within one zone \(out of the three zones devoted to storing logs\) were unable to start. The remaining pods in the zone were forced to take on extra work, such as accepting new logs, compacting data, and answering queries from our internal APIs. This led to slow searches and slow load times for any part of the Web UI that displays data about logs. **How we fixed it:** We temporarily added more pods to run API calls to increase the odds of them succeeding. We changed the podManagementPolicy to “Parallel” to allow all pods to restart, regardless of their position in the ordered sequence for starting up. We made manual edits to the pod that had not restarted cleanly so it could start again. These steps brought search latency back to normal speeds and made API calls work again. We cordoned off two pods that had fallen far behind in processing to allow them to recover without taking on new tasks. This temporarily removed ~2% of logs from all search results. When these pods were caught up with all pending tasks, we made them available again for search queries. **What we are doing to prevent it from happening again:** We have changed the podManagementPolicy to “Parallel” for all nodes that store logs. We will review the podManagementPolicy of all other areas of our service and make changes where appropriate. We will add alerting and monitoring to detect high latency in search speeds and the average time to compact newly inserted logs. We’ll explore options for adding more resources to each zone of pods, so they are less likely to fall behind on processing tasks when some pods are unavailable. We’ll explore ways to prevent unclean shutdowns of pods when nodes are upgraded.
Status: Postmortem
Impact: Major | Started At: Feb. 6, 2023, 8:48 p.m.
Description: **Dates:** Start Time: Wednesday, October 5, 2022, at 14:27 UTC End Time: Wednesday, October 5, 2022, at 14:45 UTC Duration: 00:18 **What happened:** The ingestion of logs was partially halted. The WebUI was mostly unresponsive and most API calls failed. Because many newly submitted logs were not being ingested, new logs were not immediately available for Alerting, Searching, Live Tail, Graphing, Timelines, and Archiving. **Why it happened:** We recently added a new API gateway - Kong - to our service, that acts as a proxy for all other services. We had gradually increased the amount of traffic directed through the API gateway over several weeks and seen no ill effects. Prior to the incident, only some of the traffic for ingestion wen through the gateway. Kong was restarted after a routine configuration change. After the restart, all traffic for our ingestion service began to go through Kong. Our monitoring quickly revealed the Kong service did not have enough pods to keep up with the increased workload, causing many requests to fail. **How we fixed it:** We manually added more pods to the Kong service. Ingestion, the WebUI, and API calls began to work normally again. Once ingestion had resumed, LogDNA agents running on customer environments resent all locally cached logs to our service for ingestion. No data was lost. **What we are doing to prevent it from happening again:** We updated Kubernetes to always assign enough pods for the Kong API gateway service to be able to handle all traffic. We’ll update the Kong gateway to more evenly distribute ingestion traffic across available pods. We will adjust our deployment processes so pods are restarted more slowly, which will reduce the impact in a similar scenario. We’ll explore autoscaling policies so more pods could be added automatically in a similar situation.
Status: Postmortem
Impact: Minor | Started At: Oct. 5, 2022, 2:58 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.