Last checked: 6 minutes ago
Get notified about any outages, downtime or incidents for ShipHawk and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for ShipHawk.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
ShipHawk Application | Active |
ShipHawk Website | Active |
Carrier/3PL Connectors | Active |
DHL eCommerce | Active |
FedEx Web Services | Active |
LTL / Other Carrier Web Services | Active |
UPS Web Services | Active |
USPS via Endicia | Active |
USPS via Pitney Bowes | Active |
ShipHawk APIs | Active |
Shipping APIs | Active |
WMS APIs | Active |
ShipHawk Application | Active |
WMS | Active |
ShipHawk Instances | Active |
sh-default | Active |
sh-p-2 | Active |
System Connectors | Active |
Acumatica App | Active |
Amazon Web Services | Active |
Magento | Active |
Oracle NetSuite SuiteApp | Active |
Shopify App | Active |
View the latest incidents for ShipHawk and check for official updates:
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: Oct. 22, 2021, 5 p.m.
Description: **Incident summary** During an internal process that archives data, we noticed that disk usage beginning to increase and decided to upgrade the volume proactively. Due to internal AWS optimization processes, the upgrade created slowness in the system, which later led to the incident. We promoted a replica database to restore the service and service was restored at 11:45am PST. ## **Leadup** 9:30am PST - we started an internal process that archives data 10:30am PST - internal monitoring systems alerted fast increasing disk usage 10:35am PST - the volume attached to the database servers was upgraded This change resulted in degraded database performance. ## **Fault** Due to internal AWS optimization processes, the volume upgrade created slowness in the system, which later led to the incident starting at 10:42am PST. ## **Impact** Customers hosted on shared instances were not able to use the system from 10:42am PST to 11:45am PST. Affected services: * Web Portal * Workstations * ShipHawk API ## **Detection** The Incident was detected by the automated monitoring system and was reported by multiple customers. ## **Response** After receiving the alerts from the monitoring system, the engineering team connected with ShipHawk Customer Success and described the level of impact. The incident notification was posted to [https://status.shiphawk.com/](https://status.shiphawk.com/) ## **Recovery** 3 steps were performed for the service recovery: * primary database node disabled * the database replica was promoted to primary * the OLD primary node hostname was pointed to the NEW primary node by updating DNS records ## **Timeline** All times are in PST. **10/15/2021:** 10:00am - an internal process that archives data started 10:30am - internal monitoring systems alerted fast increasing disk usage 10:35am - the volume attached to the primary database node was upgraded 10:42am - the database performance degraded 10:43am - the monitoring system alerted multiple errors and API unresponsiveness 10:50am - the engineering team began an investigation of the incident 11:20am - the root cause was understood and the team created an action plan 11:30am - primary node was disabled and the replica was promoted to a primary 11:40am - OLD primary node hostname was pointed to the NEW primary node by updating DNS records **11:45am - the service is fully restored** 1:30pm - a new database replica was created and the sync process started **10/16/2021:** 2:30pm - the new database replica sync process finished ## **Root cause identification: The Five Whys** 1. The application had an outage because the database performance degraded 2. The database performance degraded because the volume, attached to the primary database node, was upgraded 3. The volume was upgraded because the disk usage fastly increased 4. Because we ran data archiving processes that used more disk than was expected 5. Because the data archiving process was tested on the environment with different primary/replica database configurations and the problem was not identified during tests ## **Root cause** The difference in configurations of the test and production systems led to missed inefficiency in the data archiving process. ## **Lessons learned** * The test environment requires configuration changes to more closely resemble production * The data archiving process should start slower * The internal process to promote replica databases to primary needs to be faster
Status: Postmortem
Impact: Critical | Started At: Oct. 15, 2021, 6:20 p.m.
Description: **Incident summary** During an internal process that archives data, we noticed that disk usage beginning to increase and decided to upgrade the volume proactively. Due to internal AWS optimization processes, the upgrade created slowness in the system, which later led to the incident. We promoted a replica database to restore the service and service was restored at 11:45am PST. ## **Leadup** 9:30am PST - we started an internal process that archives data 10:30am PST - internal monitoring systems alerted fast increasing disk usage 10:35am PST - the volume attached to the database servers was upgraded This change resulted in degraded database performance. ## **Fault** Due to internal AWS optimization processes, the volume upgrade created slowness in the system, which later led to the incident starting at 10:42am PST. ## **Impact** Customers hosted on shared instances were not able to use the system from 10:42am PST to 11:45am PST. Affected services: * Web Portal * Workstations * ShipHawk API ## **Detection** The Incident was detected by the automated monitoring system and was reported by multiple customers. ## **Response** After receiving the alerts from the monitoring system, the engineering team connected with ShipHawk Customer Success and described the level of impact. The incident notification was posted to [https://status.shiphawk.com/](https://status.shiphawk.com/) ## **Recovery** 3 steps were performed for the service recovery: * primary database node disabled * the database replica was promoted to primary * the OLD primary node hostname was pointed to the NEW primary node by updating DNS records ## **Timeline** All times are in PST. **10/15/2021:** 10:00am - an internal process that archives data started 10:30am - internal monitoring systems alerted fast increasing disk usage 10:35am - the volume attached to the primary database node was upgraded 10:42am - the database performance degraded 10:43am - the monitoring system alerted multiple errors and API unresponsiveness 10:50am - the engineering team began an investigation of the incident 11:20am - the root cause was understood and the team created an action plan 11:30am - primary node was disabled and the replica was promoted to a primary 11:40am - OLD primary node hostname was pointed to the NEW primary node by updating DNS records **11:45am - the service is fully restored** 1:30pm - a new database replica was created and the sync process started **10/16/2021:** 2:30pm - the new database replica sync process finished ## **Root cause identification: The Five Whys** 1. The application had an outage because the database performance degraded 2. The database performance degraded because the volume, attached to the primary database node, was upgraded 3. The volume was upgraded because the disk usage fastly increased 4. Because we ran data archiving processes that used more disk than was expected 5. Because the data archiving process was tested on the environment with different primary/replica database configurations and the problem was not identified during tests ## **Root cause** The difference in configurations of the test and production systems led to missed inefficiency in the data archiving process. ## **Lessons learned** * The test environment requires configuration changes to more closely resemble production * The data archiving process should start slower * The internal process to promote replica databases to primary needs to be faster
Status: Postmortem
Impact: Critical | Started At: Oct. 15, 2021, 6:20 p.m.
Description: PrintNode services are all operating normally. See https://www.printnode.com/en/status for details.
Status: Resolved
Impact: Minor | Started At: Aug. 10, 2021, 5:05 p.m.
Description: PrintNode services are all operating normally. See https://www.printnode.com/en/status for details.
Status: Resolved
Impact: Minor | Started At: Aug. 10, 2021, 5:05 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.