Last checked: a minute ago
Get notified about any outages, downtime or incidents for Superhuman and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Superhuman.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Email sending | Active |
Google Apps Gmail | Active |
HubSpot APIs | Active |
Salesforce | Active |
Superhuman | Active |
View the latest incidents for Superhuman and check for official updates:
Description: According to Google (and our metrics) everything should be back to normal. Thank you for bearing with us. Emails that were sent during the outage, and which you didn't cancel or retry yourself, have now been retried and have sent successfully. If you have further questions, please get in touch [email protected].
Status: Resolved
Impact: Minor | Started At: March 13, 2019, 2:07 a.m.
Description: This maintenance was completed successfully. We're monitoring things closely, but everything appears to be back to normal. Please reach out to me if you have any questions: [email protected] Conrad
Status: Resolved
Impact: None | Started At: March 2, 2019, 7:23 a.m.
Description: Hello all, Yesterday, Superhuman was down for nearly two hours due to a failure with our database. I am deeply sorry for this. We know that email is mission critical, and that this much downtime is unacceptable. During the downtime, emails could not be sent and it was not possible to log into Superhuman. If you were already logged in, you could still receive emails. If you were not logged in, then it was not possible to log in. The failure was due to two simultaneous issues: 1\. Our database was running low on disk space. 2\. One of the availability zones that our database runs in was unable to provision more disk space. For our database, we use Google Cloud SQL in High Availability mode. We also use the built-in feature to “automatically increase disk space”. We failed to realize two important things about this setup: 1\. The automatic disk space increase is very conservative. Based on current load, it would only allocate enough space for a few additional hours at peak traffic. 2\. Increasing disk space is an operation that requires both availability zones to be active. We spoke with Google Cloud Support who explained all off this in detail, and then we took the decision to temporarily disable high availability so that we could resize the primary database. This is the timeline of events: 09:40. The auto-scaler detected we had less than 25GB of free space and started to increase capacity, but this failed. 12:03. Our database ran out of disk space. 12:03-12:11. We tried to manually increase disk space and failover to another zone, but these both failed. 12:11. We opened a ticket with Google Cloud Support. 12:59. We were on the phone with Google, who provided a detailed explanation of the issue. 13:30. We disabled high availability on our database, and resized it in the working zone. 13:34. Our database was back up again. 13:34-13:50. Clients began to reconnect and send email. 13:59. Normal operations resume, though our database is not high availability for the time being. As a result of this incident, we are going to make several changes: 1\. Tonight, we are going to re-enable high availability on our database. This will cause ~10m of downtime, but we will do it when we have our lowest traffic: 11:50 pm PST. 2\. We have built our own database auto-scaler that will trigger much before the built-in auto-scaler. 3\. We have added alerting on database disk-utilization metrics so that we can pre-empt any similar failures. 4\. We will fix the client so that if the backend is unexpectedly down, it will not log you out so that you can continue to read and process email. 5\. We are going to practice failing over to our secondary read replica. This will be helpful if we are ever again in a situation where both our primary-replica pairs are not functioning. Again, I am truly sorry that this happened. These steps will ensure that we do not have a similar incident in the future. If you have any questions, please just ask: [[email protected]](mailto:[email protected]). Conrad CTO
Status: Postmortem
Impact: Critical | Started At: Feb. 28, 2019, 8:08 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: None | Started At: Feb. 8, 2019, 5:19 a.m.
Description: We’ve now fixed the underlying issue, and notified all users who’s emails did not send.
Status: Resolved
Impact: Major | Started At: Dec. 11, 2018, 11:20 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.