Last checked: 7 minutes ago
Get notified about any outages, downtime or incidents for Mezmo and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Mezmo.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Log Analysis | Active |
Alerting | Active |
Archiving | Active |
Livetail | Active |
Log Ingestion (Agent/REST API/Code Libraries) | Active |
Log Ingestion (Heroku) | Active |
Log Ingestion (Syslog) | Active |
Search | Active |
Web App | Active |
Pipeline | Active |
Destinations | Active |
Ingestion / Sources | Active |
Processors | Active |
Web App | Active |
View the latest incidents for Mezmo and check for official updates:
Description: The Pipeline UI is now fully functional.
Status: Resolved
Impact: Critical | Started At: Oct. 26, 2024, 1 a.m.
Description: **Dates:** Start Time: Monday, December 4, 2023, at 10:29 UTC End Time: Monday, December 4, 2023, at 12:01 UTC Duration: 92 minutes **What happened:** Web UI users were logged out frequently – usually within 1-2 minutes of logging in. Users could successfully login again without any issues, but the session would expire shortly afterwards. **Why it happened:** It was identified that both Web UI pods and the Redis database pods, which are responsible for storing user sessions, experienced a critical memory shortage, leading to uncontrolled data purging. When this same issue happened in July 2023, our engineering team deployed a fix that enhanced how Redis stores the user session keys. This fix successfully prevented any recurrence of the problem until today. The team is still determining what made it exceed the memory limit this time. **How we fixed it:** Initially, the Web UI pods were restarted, but that did not resolve the problem permanently. The engineering team then restarted the Redis database pods and the session stopped expiring. **What we are doing to prevent it from happening again:** The team will revise the previous fix, including implementing a mechanism for the pod to automatically restart upon reaching its limit and setting up alerts to notify an engineer when it's approaching that threshold.
Status: Postmortem
Impact: Minor | Started At: Dec. 4, 2023, 12:06 p.m.
Description: **Dates:** Start Time: 8:32 pm UTC, Tuesday August 29th, 2023 End Time: 10:04 pm UTC, Tuesday August 29th, 2023 Duration: 92 minutes **What happened:** Our Kong Gateway service stopped functioning and all connection requests to our ingestion service and web service failed. The Web UI did not load and log lines could not be sent by either our agent or API. Log lines sent using syslog were unaffected. Kong was unavailable for two periods of time: one lasting 27 minutes \(8:32 pm UTC to 8:59 pm UTC\) and another lasting 9 minutes \(9:43 pm UTC to 9:52 pm UTC\). Once Kong became available, the Web UI was immediately accessible again. Agents resent locally cached log lines \(as did any APIs implemented with retry strategies\). Our service then processed the backlog of log lines, passing them to downstream services such as alerting, live tail, archiving, and indexing \(which makes lines visible in the Web UI for searching, graphing, and timelines\). The extra processing was completed ~20 minutes after Kong returned to normal usage the first time, and ~10 minutes after the second time. **Why it happened:** The pods running our Kong Gateway were overwhelmed with connection requests. CPU increased to a point that health checks started to fail and the pods were shut down. We’ve determined through research and experimentation that the cause was a sudden, brief increase in the volume of traffic directed to our service. Our service is designed to handle increases in traffic, but these were approximately 100 times above normal usage. The source\(s\) of the traffic are unknown. The increase came in two spikes, which correspond to the two periods when Kong became unavailable. **How we fixed it:** We manually scaled up the number of pods devoted to running our Kong Gateway. During the first spike of traffic, we doubled the number of pods; during the second, we quadrupled the number. This certainly helped speed up the processing of the backlog of log lines sent by agents once Kong was again available. It’s unclear whether the higher number of pods would have been able to process the spikes of traffic as they were happening. **What we are doing to prevent it from happening again:** We are running our Kong service with more pods so there are more resources to handle any similar spikes in traffic. We will add auto-scaling to the Kong service so more pods are made available automatically as needed. We’ll also add metrics to identify the origin of any similar spikes in traffic.
Status: Postmortem
Impact: Major | Started At: Aug. 29, 2023, 9:01 p.m.
Description: **Dates:** Start Time: Monday, June 19, 2023, at 10:31 UTC End Time: Monday, June 19, 2023, at 12:35 UTC Duration: 124 minutes **What happened:** Users were being logged out of our WebUI frequently – within 1-2 minutes of logging in. Users could successfully login again, but the new session would also expire quickly. **Why it happened:** The cache of logged in users held in our Redis database was being cleared every 1-2 minutes. This caused all user sessions to expire and new logins to be required. We have yet to ascertain why the cache was being periodically cleared at frequent intervals. **How we fixed it:** We restarted the pods running the Redis database and the cache behavior returned to normal. **What we are doing to prevent it from happening again:** We will investigate further to learn why the Redis cache was being frequently cleared.
Status: Postmortem
Impact: Minor | Started At: June 19, 2023, 11:09 a.m.
Description: **Dates:** Start Time: Monday, May 1, 2023, at 19:55 UTC End Time: Monday, May 1, 2023, at 20:11 UTC Duration: 16 minutes **What happened:** The WebUI was unresponsive, returning an error of “failure to get a peer from the ring-balancer.” **Why it happened:** All Mezmo services run within a service mesh. The portion of the mesh dedicated to the pods running our Mongo database began receiving many connection requests, more than its allocated CPU and memory could handle at once. This portion of the mesh \(which itself runs on pods\) quickly ran out of memory. This made the Mongo database unavailable to other services. The WebUI relies entirely on Mongo for account information and therefore became unresponsive, returning an error of “failure to get a peer from the ring-balancer.” While the immediate reason for the incident is clear, the root cause is still unknown. We suspect there was a change in user usage patterns \(e.g. increased traffic, login attempts, etc\) which triggered the incident. **How we fixed it:** We removed the WebUI from the service mesh. The Mongo service has more CPU and memory resources allocated to it and was able to accept the high level of connection requests successfully. WebUI usage immediately returned to normal. **What we are doing to prevent it from happening again:** We will change the default settings for the service mesh to allocate more CPU and memory resources, permanently. Afterwards, we will add the Mongo service back to the service mesh.
Status: Postmortem
Impact: Critical | Started At: May 1, 2023, 8:18 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.