Last checked: 5 minutes ago
Get notified about any outages, downtime or incidents for Mezmo and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Mezmo.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Log Analysis | Active |
Alerting | Active |
Archiving | Active |
Livetail | Active |
Log Ingestion (Agent/REST API/Code Libraries) | Active |
Log Ingestion (Heroku) | Active |
Log Ingestion (Syslog) | Active |
Search | Active |
Web App | Active |
Pipeline | Active |
Destinations | Active |
Ingestion / Sources | Active |
Processors | Active |
Web App | Active |
View the latest incidents for Mezmo and check for official updates:
Description: ## Dates: Start Time: Friday, February 26, 2021, at 06:43 UTC End Time: Friday, February 26, 2021, at 20:42 UTC ## What happened: The insertion of newly submitted logs stopped entirely for all accounts for about 3 hours. Logs were still available in Live Tail but not for searching, graphing, and timelines. The ingestion of logs from clients was not interrupted and no data was lost. For more than 95% of newly submitted logs, log processing returned to normal speeds within 3 hours. All logs submitted during the 3 hour pause were available again about 30 minutes later. For less than 5% of newly submitted logs, log processing returned to normal speeds gradually. Logs submitted during the 3 hour pause also gradually became available. This impact was limited to about 12% of accounts. The incident was closed when logs from all time periods for all accounts were entirely available. ## Why it happened: Our service ran out of a set of resources that manage pre-sharding on the clusters that store logs, an operation that ensures new logs are promptly inserted into the clusters. This happened because of several simultaneous changes to our infrastructure that didn’t account for the need for more resources, particularly on clusters with a relatively large number of shards relative to their overall storage capacity. The insertion of new logs slowed down and the backlog of unprocessed logs grew. Eventually, the portion of our service that processes new logs was unable to keep up with demand. ## How we fixed it: We restarted the portion of our service that processes newly submitted logs. During the recovery, we prioritized restoring logs submitted in the last day. 95% of accounts were fully recovered after 3.5 hours. ## What we are doing to prevent it from happening again: We’ve increased the scale of the set of resources that ensure logs are processed promptly by adding more servers for these resources to run upon. We’ve also added alerting for when these resources are reaching their limit.
Status: Postmortem
Impact: Minor | Started At: Feb. 26, 2021, 6:43 a.m.
Description: **Dates:** Start Time: Thursday, January 14, 2021, at 19:42 UTC End Time: Thursday, January 14, 2021, at 20:27 UTC Duration: 0:45:00 **What happened:** Our WebUI became unavailable and ingestion of new logs stopped for 45 minutes. Logs were automatically resent later and ingested successfully for customers using our ingestion client agent. **Why it happened:** The certificate used by all our services expired. Consequently, all API calls to our service failed, which caused our WebUI to fail and ingestion of new logs to stop. **How we fixed it:** We renewed the certificate and applied it to all affected services. Our WebUI became responsive again and ingestion resumed. Since no logs had been ingested for about 45 minutes, our service had a moderately large backlog to process. As it caught up, users experienced delays in searching, graphing, and timelines for newly submitted logs. **What we are doing to prevent it from happening again:** We’re tightening our internal notifications of upcoming expiration dates for all certificates our service relies upon.
Status: Postmortem
Impact: Critical | Started At: Jan. 14, 2021, 8:05 p.m.
Description: **Dates:** Start Time: Thursday, January 14, 2021, at 19:42 UTC End Time: Thursday, January 14, 2021, at 20:27 UTC Duration: 0:45:00 **What happened:** Our WebUI became unavailable and ingestion of new logs stopped for 45 minutes. Logs were automatically resent later and ingested successfully for customers using our ingestion client agent. **Why it happened:** The certificate used by all our services expired. Consequently, all API calls to our service failed, which caused our WebUI to fail and ingestion of new logs to stop. **How we fixed it:** We renewed the certificate and applied it to all affected services. Our WebUI became responsive again and ingestion resumed. Since no logs had been ingested for about 45 minutes, our service had a moderately large backlog to process. As it caught up, users experienced delays in searching, graphing, and timelines for newly submitted logs. **What we are doing to prevent it from happening again:** We’re tightening our internal notifications of upcoming expiration dates for all certificates our service relies upon.
Status: Postmortem
Impact: Critical | Started At: Jan. 14, 2021, 8:05 p.m.
Description: **Dates:** The incident was opened on December 17, 2020 - 23:29 UTC. Our service was fully operational by December 18, 2020 - 12:30 UTC. The incident was officially closed on December 20, 2020 - 03:49 UTC. **What happened:** All services were unavailable for about eight hours. For an additional four hours, services were available but there were significant delays in searching, graphing, and timelines for newly submitted logs. Additionally, all logs submitted during the first six hours of the incident were never processed by our service and were unavailable in the UI, even after our service was fully operational. **Why it happened:** Our hosting provider had a major power failure that lasted almost five hours. The hardware that our service runs on was unavailable and none of our services could operate. More details: [https://status.equinixmetal.com/incidents/pfgmgy1fnjcp](https://status.equinixmetal.com/incidents/pfgmgy1fnjcp) **How we fixed it:** Once our provider was back online, we gradually restarted all our services. This took time and manual intervention because all our services had been taken down ungracefully by the outage. Around December 18, 2020 - 07:54 UTC, services became operational and logs began to be ingested again. Since no logs had been ingested for about eight hours, our service had a large backlog to process. As it caught up, users experienced delays in searching, graphing, and timelines for newly submitted logs. The backlog was fully processed around December 18, 2020 - 12:30 UTC and services were once again fully operational. Logs submitted during the first six hours of the incident \(around December 17, 2020, 23:00 UTC to December 18, 2020, 5:00 UTC\) remained unavailable in the UI. Normally, if our service is temporarily unavailable, logs can be resubmitted and successfully processed. In this case, the sudden loss of power brought down our services ungracefully, abruptly interrupting write operations as we processed logs. This resulted in partial writes and bad writes, which made our service unable to determine, for the resubmitted logs, where log lines began. In effect, this made logs resubmitted from that six hour period of time unreadable and unable to be processed. The incident was kept open as we made attempts to read and process these logs, but these efforts were ultimately unsuccessful. After the incident was closed, we developed the means to restore archives of these logs to all customers with version 3 of archiving enabled. The restoration of archives is expected to begin on the week of January 18th. **What we are doing to prevent it from happening again:** We are developing changes to how we write logs so that in a similar event our service will not lose track of the start of log lines and be able to read and process them.
Status: Postmortem
Impact: Minor | Started At: Dec. 17, 2020, 11:29 p.m.
Description: **Dates:** The incident was opened on December 17, 2020 - 23:29 UTC. Our service was fully operational by December 18, 2020 - 12:30 UTC. The incident was officially closed on December 20, 2020 - 03:49 UTC. **What happened:** All services were unavailable for about eight hours. For an additional four hours, services were available but there were significant delays in searching, graphing, and timelines for newly submitted logs. Additionally, all logs submitted during the first six hours of the incident were never processed by our service and were unavailable in the UI, even after our service was fully operational. **Why it happened:** Our hosting provider had a major power failure that lasted almost five hours. The hardware that our service runs on was unavailable and none of our services could operate. More details: [https://status.equinixmetal.com/incidents/pfgmgy1fnjcp](https://status.equinixmetal.com/incidents/pfgmgy1fnjcp) **How we fixed it:** Once our provider was back online, we gradually restarted all our services. This took time and manual intervention because all our services had been taken down ungracefully by the outage. Around December 18, 2020 - 07:54 UTC, services became operational and logs began to be ingested again. Since no logs had been ingested for about eight hours, our service had a large backlog to process. As it caught up, users experienced delays in searching, graphing, and timelines for newly submitted logs. The backlog was fully processed around December 18, 2020 - 12:30 UTC and services were once again fully operational. Logs submitted during the first six hours of the incident \(around December 17, 2020, 23:00 UTC to December 18, 2020, 5:00 UTC\) remained unavailable in the UI. Normally, if our service is temporarily unavailable, logs can be resubmitted and successfully processed. In this case, the sudden loss of power brought down our services ungracefully, abruptly interrupting write operations as we processed logs. This resulted in partial writes and bad writes, which made our service unable to determine, for the resubmitted logs, where log lines began. In effect, this made logs resubmitted from that six hour period of time unreadable and unable to be processed. The incident was kept open as we made attempts to read and process these logs, but these efforts were ultimately unsuccessful. After the incident was closed, we developed the means to restore archives of these logs to all customers with version 3 of archiving enabled. The restoration of archives is expected to begin on the week of January 18th. **What we are doing to prevent it from happening again:** We are developing changes to how we write logs so that in a similar event our service will not lose track of the start of log lines and be able to read and process them.
Status: Postmortem
Impact: Minor | Started At: Dec. 17, 2020, 11:29 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.