Last checked: 3 minutes ago
Get notified about any outages, downtime or incidents for Mezmo and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Mezmo.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Log Analysis | Active |
Alerting | Active |
Archiving | Active |
Livetail | Active |
Log Ingestion (Agent/REST API/Code Libraries) | Active |
Log Ingestion (Heroku) | Active |
Log Ingestion (Syslog) | Active |
Search | Active |
Web App | Active |
Pipeline | Active |
Destinations | Active |
Ingestion / Sources | Active |
Processors | Active |
Web App | Active |
View the latest incidents for Mezmo and check for official updates:
Description: **Dates:** Start Time: Monday, November 22, 2021, at 19:01 UTC End Time: Tuesday, November 23, 2021, at 02:04 UTC Duration: 7:03:00 **What happened:** Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines. Some accounts \(about 25%\) were affected more than others. For all accounts, the ingestion of logs was not interrupted and no data was lost. **Why it happened:** Upon investigation, we discovered that the service which parses all incoming log lines was working very slowly. This service is upstream to all our other services, such as alerting, live tail, archiving, and searching; consequently, all those services were also delayed. We isolated the slow parsing to the specific content of certain log lines. These log lines exposed an inefficiency in our line parsing service which resulted in exponential growth in the time needed to parse those lines; this in turn created a bottleneck that delayed the parsing of other log lines. The inefficiency has been present for some time, but went undetected until one account started sending a large volume of these problematic lines. **How we fixed it:** The line parsing service was updated to use a new algorithm that avoids the worst-case behaviors of the original, as well as improving performance for line parsing in general. From then on, the parsing service just needed time to process the backlog of logs sent to us by customers. Likewise, the downstream services – alerting, live tail, archiving, searching – needed time to process the logs now being sent to them by the parsing service. The recovery was quicker for about 75% of our customers and slower for the other 25%. **What we are doing to prevent it from happening again:** The new parsing methodology has improved our overall performance significantly. We are also actively pursuing further optimizations.
Status: Postmortem
Impact: Minor | Started At: Nov. 22, 2021, 7:01 p.m.
Description: **Dates:** Start Time: Monday, November 22, 2021, at 19:01 UTC End Time: Tuesday, November 23, 2021, at 02:04 UTC Duration: 7:03:00 **What happened:** Newly submitted logs were not immediately available for Alerting, Searching, Live Tail, Graphing, and Timelines. Some accounts \(about 25%\) were affected more than others. For all accounts, the ingestion of logs was not interrupted and no data was lost. **Why it happened:** Upon investigation, we discovered that the service which parses all incoming log lines was working very slowly. This service is upstream to all our other services, such as alerting, live tail, archiving, and searching; consequently, all those services were also delayed. We isolated the slow parsing to the specific content of certain log lines. These log lines exposed an inefficiency in our line parsing service which resulted in exponential growth in the time needed to parse those lines; this in turn created a bottleneck that delayed the parsing of other log lines. The inefficiency has been present for some time, but went undetected until one account started sending a large volume of these problematic lines. **How we fixed it:** The line parsing service was updated to use a new algorithm that avoids the worst-case behaviors of the original, as well as improving performance for line parsing in general. From then on, the parsing service just needed time to process the backlog of logs sent to us by customers. Likewise, the downstream services – alerting, live tail, archiving, searching – needed time to process the logs now being sent to them by the parsing service. The recovery was quicker for about 75% of our customers and slower for the other 25%. **What we are doing to prevent it from happening again:** The new parsing methodology has improved our overall performance significantly. We are also actively pursuing further optimizations.
Status: Postmortem
Impact: Minor | Started At: Nov. 22, 2021, 7:01 p.m.
Description: **Start Time:** Monday, November 8, 2021, at 23:28 UTC **End Time:** Tuesday, November 9, 2021, at 00:16 UTC **Duration:** 0:48:00 **What happened:** Our Web UI returned the error message “This site can’t be reached” when users tried to login or load pages. The ingestion of logs was unaffected. **Why it happened:** The node our web service was running on had a failure with its network management software and became unreachable. Furthermore, the web service was only running on a single node, which is atypical – usually it runs on multiple nodes at once to improve performance and allow for redundancy. Both conditions were necessary for the Web UI to become unavailable. **How we fixed it:** We moved the web service to another node with functioning network management software, which made the Web UI available again. Later, we restarted the unreachable node, which restored it to normal usage. **What we are doing to prevent it from happening again:** We expect both necessary conditions – the failure of the network management software and that the web service was running on a single node – to be resolved by an already planned migration of our entire service to a new cloud-based environment. We are currently building monitoring of the availability of our Web UI so we can learn of any future failures as soon as possible.
Status: Postmortem
Impact: None | Started At: Nov. 9, 2021, midnight
Description: **Start Time:** Thursday, October 28, 2021, at 16:56:52 UTC **End Time:** Thursday, October 28, 2021, at 22:17:24 UTC **Duration:** 5:20:32 **What happened:** Email notifications of all kinds, including from alerts, were delayed for about 5 hours. Notifications sent by Slack and Webhooks were not affected. **Why it happened:** Our email service provider \(Sparkpost\) experienced an incident that caused delays for all emails from the LogDNA service. We rely on this service to deliver email of all kinds, including notifications for alerts. Email messages were delayed and queued until our email service provider was able to recover. More information on the incident can be found at Sparkpost’s Status Page: [https://status.sparkpost.com/incidents/bwl8dr6gwmts?u=ydzrh5x205pf](https://status.sparkpost.com/incidents/bwl8dr6gwmts?u=ydzrh5x205pf) **How we fixed it:** No remedial action was possible by LogDNA. We waited until the incident from Sparkpost, our email hosting provider, was resolved. **What we are doing to prevent it from happening again:** For this type of incident, LogDNA cannot take proactive preventive measures.
Status: Postmortem
Impact: Major | Started At: Oct. 28, 2021, 6:13 p.m.
Description: Start Time: Thursday, October 7, 2021, at 17:52 UTC End Time: Thursday, October 7, 2021, at 18:46 UTC Duration: 0:54:00 ## What happened: Our Web UI returned the error message “This site can’t be reached” when some users tried to login or load pages. The ingestion of logs was unaffected. ## Why it happened: The Telia carrier service in Europe experienced a major network routing outage caused by a faulty configuration update. The routing policy contained an error that impacted traffic to our service hosting provider, Equinix Metal. The Washington DC data center that houses our services was impacted. During this incident the [app.logdna.com](http://app.logdna.com) site was unreachable for some customers, depending on their location. ## How we fixed it: No remedial action was possible by LogDNA. We waited until the incident from Equinix Metal, our service hosting provider, was resolved. ## What we are doing to prevent it from happening again: For this type of incident, LogDNA cannot take proactive preventive measures.
Status: Postmortem
Impact: Major | Started At: Oct. 7, 2021, 5:52 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.