Last checked: 8 minutes ago
Get notified about any outages, downtime or incidents for Mezmo and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Mezmo.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Log Analysis | Active |
Alerting | Active |
Archiving | Active |
Livetail | Active |
Log Ingestion (Agent/REST API/Code Libraries) | Active |
Log Ingestion (Heroku) | Active |
Log Ingestion (Syslog) | Active |
Search | Active |
Web App | Active |
Pipeline | Active |
Destinations | Active |
Ingestion / Sources | Active |
Processors | Active |
Web App | Active |
View the latest incidents for Mezmo and check for official updates:
Description: **Dates:** Start Time: Thursday, February 18, 2022, at 00:10 UTC End Time: Thursday, February 24, 2022, at 23:43 UTC Duration: 167:33:00 **What happened:** The ingestion of new logs to our Syslog endpoint was intermittently failing. **Why it happened:** We recently introduced a new service \(Syslog Forwarder\) to handle the ingestion of logs sent over Syslog. As the name implies, it forwards logs to downstream services. It was designed to send all logs submitted for each account to a single port opened on the downstream services. No load balancing was implemented in our original design, which performed well in our advance testing. Once put into production, however, it became apparent that some customer accounts submit logs at a volume higher than the downstream services could process. When this happened, logs lines were buffered in memory by the Syslog Forwarder. Memory increased until the pods crashed. Any log lines held on those pods were lost and never ingested. **How we fixed it:** We improved the design of the Syslog Forwarder by adding a pool of connections to the downstream services. In effect, we added traffic shaping to the Syslog Forwarder. **What we are doing to prevent it from happening again:** The new architecture has been incorporated and proven resilient in production. No further work is needed to prevent this kind of incident from happening again.
Status: Postmortem
Impact: Major | Started At: Feb. 24, 2022, 10:45 p.m.
Description: **Dates:** Start Time: Wednesday, February 17, 2022, at 20:56 UTC End Time: Thursday, February 18, 2022, at 02:15 UTC Duration: 5:19:00 **What happened:** The ingestion of new logs to our Syslog endpoint was intermittently failing. **Why it happened:** We made a code change to the area of our service \(Syslog Forwarder\) that handles the ingestion of logs sent by Syslog and inadvertently changed how memory is managed. Routine memory garbage collection stopped and memory usage increased on the pods that accept newly submitted log lines over Syslog. Eventually, the increase in memory caused the pods to crash. Any log lines held on those pods were lost and never ingested. **How we fixed it:** We reverted to the previous version of the Syslog Forwarder service. This stopped the pods from crashing. We then resolved the memory management issue in our code. The new, fixed version was released to production shortly thereafter and performed as expected. **What we are doing to prevent it from happening again:** We have added regression tests to the Syslog Forwarder service to prevent a similar mistake in the future.
Status: Postmortem
Impact: Major | Started At: Feb. 18, 2022, 1:41 a.m.
Description: **Dates:** Start Time: Wednesday, February 17, 2022, at 20:56 UTC End Time: Thursday, February 18, 2022, at 02:15 UTC Duration: 5:19:00 **What happened:** The ingestion of new logs to our Syslog endpoint was intermittently failing. **Why it happened:** We made a code change to the area of our service \(Syslog Forwarder\) that handles the ingestion of logs sent by Syslog and inadvertently changed how memory is managed. Routine memory garbage collection stopped and memory usage increased on the pods that accept newly submitted log lines over Syslog. Eventually, the increase in memory caused the pods to crash. Any log lines held on those pods were lost and never ingested. **How we fixed it:** We reverted to the previous version of the Syslog Forwarder service. This stopped the pods from crashing. We then resolved the memory management issue in our code. The new, fixed version was released to production shortly thereafter and performed as expected. **What we are doing to prevent it from happening again:** We have added regression tests to the Syslog Forwarder service to prevent a similar mistake in the future.
Status: Postmortem
Impact: Major | Started At: Feb. 18, 2022, 1:41 a.m.
Description: **Dates:** Start Time: Wednesday, February 16, 2022 at 19:58 UTC End Time: Wednesday, February 16, 2022 at 21:10 UTC Duration: 1:12:00 **What happened:** Some logs being sent to our service over syslog using custom ports were not being correctly parsed and were not available for Alerting, Searching, Timelines, Graphs, and Live Tail. Unparsable log lines showed the error “Unidentifiable Syslog Source” and “Unsupported syslog format.” Logs being sent over syslog that do not use custom ports were working normally. **Why it happened:** We introduced a bug into our production environment, specifically in a new service called Syslog Forwarder. The bug prevented Syslog lines from being parsed. As a result, any newly submitted Syslog lines sent using custom ports were not parsed. The lines displayed an error “Unidentifiable Syslog Source” and “Unsupported syslog format.” **How we fixed it:** We created a hot fix that corrected the bug. **What we are doing to prevent it from happening again:** We added to our test suite to guard against regressions in the Syslog Forwarder.
Status: Postmortem
Impact: Major | Started At: Feb. 16, 2022, 7:58 p.m.
Description: **Dates:** Start Time: Tuesday, February 8, 2022, at 13:17 UTC End Time: Tuesday, February 8, 2022, at 14:21 UTC Duration: 1:04:00 **What happened:** Our Web UI was unresponsive for about 10 minutes. Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines. No data was lost and ingestion was not halted. **Why it happened:** Our Redis database had a failover and the services that depend on it were unable to reconnect after it recovered, including the Parser. This service is upstream of many other services. Consequently, newly submitted logs were not passed on to many downstream services, such as Alerting, Live Tail, Searching, Graphing, and Timelines. The WebUI was also intermittently unavailable because it requires a connection to Redis. **How we fixed it:** We manually restarted the Redis service which allowed a new master to be elected. After Redis recovered, the Parser, Web UI and other services were restarted which were then able to reestablish a connection to Redis. This restored the Web UI and allowed newly submitted logs to pass from our Parser service to all downstream services. Over a short period of time, these services processed the backlog of logs and newly submitted logs were again available without delays. **What we are doing to prevent it from happening again:** We recently added functionality to track the flow rate of newly submitted logs. This new feature requires more memory than expected in the event of a Redis failover, which is why services could not reconnect to Redis. We’ve increased the limits of the memory buffer for the relevant portions of our service. We will also add additional Redis monitoring to more quickly detect unhealthy sentinels and continue to work on an ongoing project to make all services more tolerant of Redis failovers.
Status: Postmortem
Impact: None | Started At: Feb. 8, 2022, 1:38 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.