Last checked: 7 minutes ago
Get notified about any outages, downtime or incidents for Mezmo and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Mezmo.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Log Analysis | Active |
Alerting | Active |
Archiving | Active |
Livetail | Active |
Log Ingestion (Agent/REST API/Code Libraries) | Active |
Log Ingestion (Heroku) | Active |
Log Ingestion (Syslog) | Active |
Search | Active |
Web App | Active |
Pipeline | Active |
Destinations | Active |
Ingestion / Sources | Active |
Processors | Active |
Web App | Active |
View the latest incidents for Mezmo and check for official updates:
Description: **Dates:** Start Time: Tuesday, February 8, 2022, at 13:17 UTC End Time: Tuesday, February 8, 2022, at 14:21 UTC Duration: 1:04:00 **What happened:** Our Web UI was unresponsive for about 10 minutes. Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines. No data was lost and ingestion was not halted. **Why it happened:** Our Redis database had a failover and the services that depend on it were unable to reconnect after it recovered, including the Parser. This service is upstream of many other services. Consequently, newly submitted logs were not passed on to many downstream services, such as Alerting, Live Tail, Searching, Graphing, and Timelines. The WebUI was also intermittently unavailable because it requires a connection to Redis. **How we fixed it:** We manually restarted the Redis service which allowed a new master to be elected. After Redis recovered, the Parser, Web UI and other services were restarted which were then able to reestablish a connection to Redis. This restored the Web UI and allowed newly submitted logs to pass from our Parser service to all downstream services. Over a short period of time, these services processed the backlog of logs and newly submitted logs were again available without delays. **What we are doing to prevent it from happening again:** We recently added functionality to track the flow rate of newly submitted logs. This new feature requires more memory than expected in the event of a Redis failover, which is why services could not reconnect to Redis. We’ve increased the limits of the memory buffer for the relevant portions of our service. We will also add additional Redis monitoring to more quickly detect unhealthy sentinels and continue to work on an ongoing project to make all services more tolerant of Redis failovers.
Status: Postmortem
Impact: None | Started At: Feb. 8, 2022, 1:38 p.m.
Description: **Dates:** Start Time: Wednesday, January 26, 2022, at 15:45:00 UTC End Time: Wednesday, January 26, 2022, at 16:30:00 UTC Duration: 00:45:00 **What happened:** Ingestion was halted and newly submitted logs were not immediately available for Alerting, Live Tail, Searching, Graphing, and Timelines. Some alerts were never triggered. Once ingestion had resumed, LogDNA agents running on customer environments resent all locally cached logs to our service for ingestion. No data was lost. **Why it happened:** Our Redis database had a failover and the services that depend on it were unable to recover automatically. Normally, the pods running our ingestion service deliberately crash until they are able to access Redis again. However, these pods were in a bad state and unable to reconnect when Redis returned. Since ingestion was halted, newly submitted logs were not passed on to many downstream services, such as Alerting, Live Tail, Searching, Graphing, and Timelines. **How we fixed it:** We manually restarted all the pods of our ingestion service, then restarted all the sentinel pods of Redis. The ingestion service became operational again and logs were passed on to all downstream services. Over a short period of time, these services processed the backlog of logs and newly submitted logs were again available without delays. **What we are doing to prevent it from happening again:** The ingestion pods were in a bad state because they had not been restarted after a configuration change made several days earlier, for reasons unrelated to this incident. The runbook for making such configuration changes has been updated to prevent this procedural failure in the future. We’re also in the middle of a project to make all services more tolerant of Redis failovers.
Status: Postmortem
Impact: Critical | Started At: Jan. 26, 2022, 4:10 p.m.
Description: **Dates:** Start Time: Wednesday, January 26, 2022, at 15:45:00 UTC End Time: Wednesday, January 26, 2022, at 16:30:00 UTC Duration: 00:45:00 **What happened:** Ingestion was halted and newly submitted logs were not immediately available for Alerting, Live Tail, Searching, Graphing, and Timelines. Some alerts were never triggered. Once ingestion had resumed, LogDNA agents running on customer environments resent all locally cached logs to our service for ingestion. No data was lost. **Why it happened:** Our Redis database had a failover and the services that depend on it were unable to recover automatically. Normally, the pods running our ingestion service deliberately crash until they are able to access Redis again. However, these pods were in a bad state and unable to reconnect when Redis returned. Since ingestion was halted, newly submitted logs were not passed on to many downstream services, such as Alerting, Live Tail, Searching, Graphing, and Timelines. **How we fixed it:** We manually restarted all the pods of our ingestion service, then restarted all the sentinel pods of Redis. The ingestion service became operational again and logs were passed on to all downstream services. Over a short period of time, these services processed the backlog of logs and newly submitted logs were again available without delays. **What we are doing to prevent it from happening again:** The ingestion pods were in a bad state because they had not been restarted after a configuration change made several days earlier, for reasons unrelated to this incident. The runbook for making such configuration changes has been updated to prevent this procedural failure in the future. We’re also in the middle of a project to make all services more tolerant of Redis failovers.
Status: Postmortem
Impact: Critical | Started At: Jan. 26, 2022, 4:10 p.m.
Description: **Dates:** Start Time: Thursday, January 20, 2022, at 19:13:00 UTC End Time: Thursday, January 20, 2022, at 21:24:00 UTC Duration: 02:11:00 **What happened:** Ingestion was halted and our Web UI was unresponsive for about 5-10 minutes. Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines. **Why it happened:** Our service hosting provider Equinix Metal had an outage that was caused by the failure of one of their main switches \(more details at [https://status.equinixmetal.com/incidents/gjmh37y6rkjp](https://status.equinixmetal.com/incidents/gjmh37y6rkjp)\). The outage impacted traffic and global network connectivity to the LogDNA service. During the Equinix Metal incident, Ingestion, Alerting, and Live Tail were halted and our Web UI was unresponsive for a period of 5-10 minutes. Multiple ElasticSearch \(ES\) clusters went into an unhealthy state which caused delays for about one hour in newly submitted logs being immediately available for Searching, Graphing, and Timelines. **How we fixed it:** No remedial action was possible by LogDNA. We waited until the incident from Equinix Metal, our service hosting provider, was resolved. The ES clusters were repaired and the backlog of newly submitted logs was processed in about one hour. **What we are doing to prevent it from happening again:** For this type of incident, LogDNA cannot take proactive preventive measures.
Status: Postmortem
Impact: None | Started At: Jan. 20, 2022, 7:57 p.m.
Description: **Dates:** Start Time: Thursday, January 20, 2022, at 19:13:00 UTC End Time: Thursday, January 20, 2022, at 21:24:00 UTC Duration: 02:11:00 **What happened:** Ingestion was halted and our Web UI was unresponsive for about 5-10 minutes. Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines. **Why it happened:** Our service hosting provider Equinix Metal had an outage that was caused by the failure of one of their main switches \(more details at [https://status.equinixmetal.com/incidents/gjmh37y6rkjp](https://status.equinixmetal.com/incidents/gjmh37y6rkjp)\). The outage impacted traffic and global network connectivity to the LogDNA service. During the Equinix Metal incident, Ingestion, Alerting, and Live Tail were halted and our Web UI was unresponsive for a period of 5-10 minutes. Multiple ElasticSearch \(ES\) clusters went into an unhealthy state which caused delays for about one hour in newly submitted logs being immediately available for Searching, Graphing, and Timelines. **How we fixed it:** No remedial action was possible by LogDNA. We waited until the incident from Equinix Metal, our service hosting provider, was resolved. The ES clusters were repaired and the backlog of newly submitted logs was processed in about one hour. **What we are doing to prevent it from happening again:** For this type of incident, LogDNA cannot take proactive preventive measures.
Status: Postmortem
Impact: None | Started At: Jan. 20, 2022, 7:57 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.