Last checked: 8 minutes ago
Get notified about any outages, downtime or incidents for Mezmo and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Mezmo.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Log Analysis | Active |
Alerting | Active |
Archiving | Active |
Livetail | Active |
Log Ingestion (Agent/REST API/Code Libraries) | Active |
Log Ingestion (Heroku) | Active |
Log Ingestion (Syslog) | Active |
Search | Active |
Web App | Active |
Pipeline | Active |
Destinations | Active |
Ingestion / Sources | Active |
Processors | Active |
Web App | Active |
View the latest incidents for Mezmo and check for official updates:
Description: **Dates:** Start Time: Tuesday, April 5, 2022, at 13:20:00 UTC End Time: Tuesday, April 5, 2022, at 18:20:00 UTC Duration: 5:00:00 **What happened:** Alerting was halted for all accounts for the entire duration of the incident. Most alerts – any whose trigger date was more than 15 minutes in the past – were discarded. **Why it happened:** We restarted our parser service, for reasons unrelated to this incident. Any restart of the parser service should be followed by a restart of the alerting service. This second step was overlooked and didn’t happen. Subsequently, all alerts stopped triggering. The need to restart alerting after a restart of the parser is already documented and well-known to our infrastructure team. However, the restart of the parser was performed by a team less familiar with the correct procedure. **How we fixed it:** We manually restarted the alerting service, which then returned to normal operation. **What we are doing to prevent it from happening again:** The proper documented restart procedure has been discussed with all teams allowed to restart services. We will add monitoring of our alerting service and automated notifications so we learn more quickly of any similar incidents in the future.
Status: Postmortem
Impact: Critical | Started At: April 5, 2022, 6:23 p.m.
Description: **Dates:** Start Time: Tuesday, April 5, 2022, at 13:20:00 UTC End Time: Tuesday, April 5, 2022, at 18:20:00 UTC Duration: 5:00:00 **What happened:** Alerting was halted for all accounts for the entire duration of the incident. Most alerts – any whose trigger date was more than 15 minutes in the past – were discarded. **Why it happened:** We restarted our parser service, for reasons unrelated to this incident. Any restart of the parser service should be followed by a restart of the alerting service. This second step was overlooked and didn’t happen. Subsequently, all alerts stopped triggering. The need to restart alerting after a restart of the parser is already documented and well-known to our infrastructure team. However, the restart of the parser was performed by a team less familiar with the correct procedure. **How we fixed it:** We manually restarted the alerting service, which then returned to normal operation. **What we are doing to prevent it from happening again:** The proper documented restart procedure has been discussed with all teams allowed to restart services. We will add monitoring of our alerting service and automated notifications so we learn more quickly of any similar incidents in the future.
Status: Postmortem
Impact: Critical | Started At: April 5, 2022, 6:23 p.m.
Description: **Dates:** Start Time: Saturday, February 26, 2022, at 19:51 UTC End Time: Sunday, February 27, 2022, at 22:13 UTC Duration: 26:22:00 **What happened:** Ingestion of new logs to our Syslog endpoint – for logs sent using a custom port, only – was intermittently delayed. **Why it happened:** We recently introduced a new service \(Syslog Forwarder\) to handle the ingestion of logs sent over Syslog. As the name implies, it forwards logs to downstream services. Logs are sent from a range of ports on Syslog Forwarder to a range of ports used by clients running on downstream services. This design worked well in our advance testing, using a limited number of custom ports. Once running in production, however, the Syslog Forwarder needed to connect to a much larger number of custom ports. We then saw that the ephemeral port ranges of the clients running on downstream services overlapped with the port ranges used by the Syslog Forwarder. This led to occasional port conflicts when services and/or clients tried to start. The services and/or clients would attempt to start again until they found an open port without conflicts. This created delays in ingestion. **How we fixed it:** We changed the ephemeral port ranges of the clients running on downstream services so they no longer overlapped with the port ranges used by the Syslog Forwarder. **What we are doing to prevent it from happening again:** The new ephemeral port range has been incorporated and proven resilient in production. No further work is needed to prevent this kind of incident from happening again.
Status: Postmortem
Impact: None | Started At: Feb. 26, 2022, 7:51 p.m.
Description: **Dates:** Start Time: Saturday, February 26, 2022, at 19:51 UTC End Time: Sunday, February 27, 2022, at 22:13 UTC Duration: 26:22:00 **What happened:** Ingestion of new logs to our Syslog endpoint – for logs sent using a custom port, only – was intermittently delayed. **Why it happened:** We recently introduced a new service \(Syslog Forwarder\) to handle the ingestion of logs sent over Syslog. As the name implies, it forwards logs to downstream services. Logs are sent from a range of ports on Syslog Forwarder to a range of ports used by clients running on downstream services. This design worked well in our advance testing, using a limited number of custom ports. Once running in production, however, the Syslog Forwarder needed to connect to a much larger number of custom ports. We then saw that the ephemeral port ranges of the clients running on downstream services overlapped with the port ranges used by the Syslog Forwarder. This led to occasional port conflicts when services and/or clients tried to start. The services and/or clients would attempt to start again until they found an open port without conflicts. This created delays in ingestion. **How we fixed it:** We changed the ephemeral port ranges of the clients running on downstream services so they no longer overlapped with the port ranges used by the Syslog Forwarder. **What we are doing to prevent it from happening again:** The new ephemeral port range has been incorporated and proven resilient in production. No further work is needed to prevent this kind of incident from happening again.
Status: Postmortem
Impact: None | Started At: Feb. 26, 2022, 7:51 p.m.
Description: **Dates:** Start Time: Thursday, February 18, 2022, at 00:10 UTC End Time: Thursday, February 24, 2022, at 23:43 UTC Duration: 167:33:00 **What happened:** The ingestion of new logs to our Syslog endpoint was intermittently failing. **Why it happened:** We recently introduced a new service \(Syslog Forwarder\) to handle the ingestion of logs sent over Syslog. As the name implies, it forwards logs to downstream services. It was designed to send all logs submitted for each account to a single port opened on the downstream services. No load balancing was implemented in our original design, which performed well in our advance testing. Once put into production, however, it became apparent that some customer accounts submit logs at a volume higher than the downstream services could process. When this happened, logs lines were buffered in memory by the Syslog Forwarder. Memory increased until the pods crashed. Any log lines held on those pods were lost and never ingested. **How we fixed it:** We improved the design of the Syslog Forwarder by adding a pool of connections to the downstream services. In effect, we added traffic shaping to the Syslog Forwarder. **What we are doing to prevent it from happening again:** The new architecture has been incorporated and proven resilient in production. No further work is needed to prevent this kind of incident from happening again.
Status: Postmortem
Impact: Major | Started At: Feb. 24, 2022, 10:45 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.