Last checked: a minute ago
Get notified about any outages, downtime or incidents for ShipHawk and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for ShipHawk.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
ShipHawk Application | Active |
ShipHawk Website | Active |
Carrier/3PL Connectors | Active |
DHL eCommerce | Active |
FedEx Web Services | Active |
LTL / Other Carrier Web Services | Active |
UPS Web Services | Active |
USPS via Endicia | Active |
USPS via Pitney Bowes | Active |
ShipHawk APIs | Active |
Shipping APIs | Active |
WMS APIs | Active |
ShipHawk Application | Active |
WMS | Active |
ShipHawk Instances | Active |
sh-default | Active |
sh-p-2 | Active |
System Connectors | Active |
Acumatica App | Active |
Amazon Web Services | Active |
Magento | Active |
Oracle NetSuite SuiteApp | Active |
Shopify App | Active |
View the latest incidents for ShipHawk and check for official updates:
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: Dec. 6, 2021, 4:30 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: Dec. 6, 2021, 4:30 p.m.
Description: ## **Incident summary** Between 6:30am and 3:30pm PST, several customers experienced slowness of the application. ## **Leadup** In preparation for the peak season, we provisioned additional servers for anticipated volume. Our customers collectively experienced larger order, shipment and rate request volumes than we expected. Additionally, FedEx, UPS and other Carrier APIs experienced delayed response times to requests made by our system. The combination of these issues slowed down ShipHawk API response times for some customers. ## **Fault** With the load more than expected, API response time slowed down. Automated load balancer marked some of the slower servers as unhealthy which led to higher load on healthy servers and that slowed down the system even more. The engineering team made a decision to add more servers to help handle the extra load. The added resources did not help. Adding new resources for rating caused much higher use of database connections, which resulted in errors and did not help with performance degradation. ## **Impact** ShipHawk users experienced slowness of the service from 6:30 am PST till 3:30 pm PST. Some of the API requests were failing by timeout and syncing with external systems was delayed. A total of 9 urgent support cases were submitted to ShipHawk during the impact window. ## **Detection** It was first detected by monitoring systems at 6:30 am PST and then was reported by customers at 6:42 am PST ## **Response** Customers were notified about the slowness via our status page at 6:44am PST. We responded to the incident with all possible urgency and ultimately made the necessary changes to solve the problem while continuing to processing similar volumes to Black Friday and Cyber Monday through the end of the week. ## **Recovery** We needed to add more servers for processing extra API requests, but that created too many connections to the database. The solution was to implement a database connection pooling system that allowed us to optimize the database connections usage. Around 3:00 pm PST, the new connection pool system was activated and we were able to added more resources to process API requests and background jobs. That resolved the slowness at 3:30pm PST. To further mitigate the chances of another incident, we set up redundant connection poolers and provisioned more resources to production throughout the night. That proved effective during the next day \(Tuesday 11/30\), when ShipHawk was experience similar API load and response times remained stable throughout. ## **Timeline** All times in PST **Monday, 29 November** 6:30am - monitoring systems alerted average API response time increase and an increased number of "499 Client Closed Request" errors 6:32 am - engineering team started investigating the slowness 6:42 am - customers reported slowness of Item Fulfillments sync and overall application slowness 6:44 am - Status Page was updated with the details about the incident. 7:30 am - API load balancer reconfigured, to prevent a cascade effect when the load balancer was removing slow instances from the pool which was adding more load to healthy instances, and that made them slow/unhealthy too 8:00 am - application servers reconfigured, more resources moved to API services from backend services, to better match the type of the load 9:00 am - existing servers upgraded to more powerful EC2 instances, extra servers provisioned for handling the extra load 10:00 am - monitoring systems detected errors related to extra high use of database connections which prevented us from provisioning more servers 11:00 am - the decision was made to configure a new database connection pooling system that should mitigate the database connections issue and allow provisioning more resources 3:00 pm - a new database connection pooling system was installed and configured 3:30 pm - confirmed that the incident resolved **Tuesday, 30 November** 12:00am - 4:30am - additional application and background processing servers added for redundancy ## **Root cause identification: The Five Whys** 1. The application had degraded performance because of added load on the API and slow carrier response times. 2. The system did not automatically address the added load because database connections were consumed. 3. Because we pushed extra resources and didn’t expect this to cause an issue with database connections. 4. Because we need did not have tests to cover load tests that would have identified this. 5. Because we had not previously felt this kind of testing was necessary until we reached this level of scale. ## **Root cause** Suboptimal use of database connections led to issues with the application scaling. The team did not have an immediate solution for that because the issue had not been replicated in testing. ## **Lessons learned** * We need more application load testing in place. * Carrier API response slowness can cause slowness for the application. * Customers with high API usage volatility should isolated from other multi-tenant users. ## **Corrective actions** 1. Introduce new load testing processes. 2. Implement better automated scaling system for the peak load periods. 3. Prioritize solutions to mitigate response time delays due to carrier response time delays.
Status: Postmortem
Impact: Minor | Started At: Nov. 29, 2021, 2:50 p.m.
Description: ## **Incident summary** Between 6:30am and 3:30pm PST, several customers experienced slowness of the application. ## **Leadup** In preparation for the peak season, we provisioned additional servers for anticipated volume. Our customers collectively experienced larger order, shipment and rate request volumes than we expected. Additionally, FedEx, UPS and other Carrier APIs experienced delayed response times to requests made by our system. The combination of these issues slowed down ShipHawk API response times for some customers. ## **Fault** With the load more than expected, API response time slowed down. Automated load balancer marked some of the slower servers as unhealthy which led to higher load on healthy servers and that slowed down the system even more. The engineering team made a decision to add more servers to help handle the extra load. The added resources did not help. Adding new resources for rating caused much higher use of database connections, which resulted in errors and did not help with performance degradation. ## **Impact** ShipHawk users experienced slowness of the service from 6:30 am PST till 3:30 pm PST. Some of the API requests were failing by timeout and syncing with external systems was delayed. A total of 9 urgent support cases were submitted to ShipHawk during the impact window. ## **Detection** It was first detected by monitoring systems at 6:30 am PST and then was reported by customers at 6:42 am PST ## **Response** Customers were notified about the slowness via our status page at 6:44am PST. We responded to the incident with all possible urgency and ultimately made the necessary changes to solve the problem while continuing to processing similar volumes to Black Friday and Cyber Monday through the end of the week. ## **Recovery** We needed to add more servers for processing extra API requests, but that created too many connections to the database. The solution was to implement a database connection pooling system that allowed us to optimize the database connections usage. Around 3:00 pm PST, the new connection pool system was activated and we were able to added more resources to process API requests and background jobs. That resolved the slowness at 3:30pm PST. To further mitigate the chances of another incident, we set up redundant connection poolers and provisioned more resources to production throughout the night. That proved effective during the next day \(Tuesday 11/30\), when ShipHawk was experience similar API load and response times remained stable throughout. ## **Timeline** All times in PST **Monday, 29 November** 6:30am - monitoring systems alerted average API response time increase and an increased number of "499 Client Closed Request" errors 6:32 am - engineering team started investigating the slowness 6:42 am - customers reported slowness of Item Fulfillments sync and overall application slowness 6:44 am - Status Page was updated with the details about the incident. 7:30 am - API load balancer reconfigured, to prevent a cascade effect when the load balancer was removing slow instances from the pool which was adding more load to healthy instances, and that made them slow/unhealthy too 8:00 am - application servers reconfigured, more resources moved to API services from backend services, to better match the type of the load 9:00 am - existing servers upgraded to more powerful EC2 instances, extra servers provisioned for handling the extra load 10:00 am - monitoring systems detected errors related to extra high use of database connections which prevented us from provisioning more servers 11:00 am - the decision was made to configure a new database connection pooling system that should mitigate the database connections issue and allow provisioning more resources 3:00 pm - a new database connection pooling system was installed and configured 3:30 pm - confirmed that the incident resolved **Tuesday, 30 November** 12:00am - 4:30am - additional application and background processing servers added for redundancy ## **Root cause identification: The Five Whys** 1. The application had degraded performance because of added load on the API and slow carrier response times. 2. The system did not automatically address the added load because database connections were consumed. 3. Because we pushed extra resources and didn’t expect this to cause an issue with database connections. 4. Because we need did not have tests to cover load tests that would have identified this. 5. Because we had not previously felt this kind of testing was necessary until we reached this level of scale. ## **Root cause** Suboptimal use of database connections led to issues with the application scaling. The team did not have an immediate solution for that because the issue had not been replicated in testing. ## **Lessons learned** * We need more application load testing in place. * Carrier API response slowness can cause slowness for the application. * Customers with high API usage volatility should isolated from other multi-tenant users. ## **Corrective actions** 1. Introduce new load testing processes. 2. Implement better automated scaling system for the peak load periods. 3. Prioritize solutions to mitigate response time delays due to carrier response time delays.
Status: Postmortem
Impact: Minor | Started At: Nov. 29, 2021, 2:50 p.m.
Description: ## **Incident summary** ShipHawk NetSuite SuiteApp users ran into an issue with item fulfillments and order syncing between NetSuite and ShipHawk. When NetSuite users saved an Item Shipment record, the error is shown: `TypeError: ItemFulfillment.find is not a function` This was first reported at 9:12 pm PST on Wednesday 11/17 and affected all customers that were using ShipHawk bundle with version >=2021.6.0. The issue was caused by a change made by NetSuite in the processing of NApiVersion 2.1 scripts, which ShipHawk bundles 2021.6.0\+ are using, and the incident lasted until NetSuite reverted the change at 11:30 am PST Sunday 11/21. ## **Leadup** On 11/17, NetSuite changed the way how they process scripts with NApiVersion 2.1 without notice, to fix a known and unrelated defect \(NetSuite defect #647251\). When this happened, ShipHawk SuiteApp bundles 2021.6.x and higher could no longer sync orders or item fulfillments between NetSuite and ShipHawk. ## **Fault** ShipHawk bundle could not load dependencies correctly; therefore, it was not able to call static functions required to work properly and caused the code to raise an exception `TypeError: ItemFulfillment.find is not a function [at Object.afterSubmit (/SuiteBundles/Bundle 161164/ShipHawk (2)/event_scripts/shiphawk-update-fulfillment-event-script.js:55:35)]`. ## **Impact** Order and Item Fulfillments could not sync between NetSuite and ShipHawk. This incident affected all NetSuite customers who using ShipHawk bundle - 2021.6.x and 2021.7.x. A total of 12 urgent cases were submitted to ShipHawk during the impact window. ## **Detection** The incident was reported first time at 9:12 pm PST, Wednesday 11/17. More reports were submitted starting from 4:21 am PST, Thursday 11/18. ## **Response** During this incident, ShipHawk customer success and engineering teams worked around the clock to keep impacted customers informed, identify the root cause and search for workarounds. ShipHawk and NetSuite engineering resources worked to identify the issues and work towards a resolution. NetSuite discovered two defects \(defect #651122 and #651305\) which they ultimately resolved. ShipHawk identified both near-term and long-term options to mitigate this in the future, both of which would have materially delayed resolution. As such, ShipHawk Engineering decided the best path was to collaborate with NetSuite as they reverted the changes introduced on 11/17 because this was determined to be the fastest way to get joint customers operational. ## **Recovery** Case #4491650 was submitted to NetSuite Support, and as a result, NetSuite created 2 defects that were escalated to U2 Critical priority: **Defect 651122**`SuiteScript > RESTLet Script > TypeError: Class constructor CounterEntry cannot be invoked without 'new'` **Defect 651305** `SuiteScript > RESTLet Script > TypeError: Class constructor CounterEntry cannot be invoked without 'new'`. NetSuite ultimately reverted their changes in the processing of NApiVersion=2.1 scripts. After clearing cached files, impacted customers were able to sync orders and item fulfillments between ShipHawk and NetSuite. ## **Timeline** All times are PST. **Wednesday 11/17** * 21:00 - NetSuite introduced a change to the NApiVersion scripts processor in order to fix defect #647251 * 21:12 - the first time the issue that Orders and Item Fulfillments are not syncing was reported by ShipHawk customers **Thursday 11/18** * 4:21 - multiple customers started reporting the same issue * 5:00 - the issue was verified by ShipHawk CS team and passed to the Engineering team * 6:21 - the incident notification was posted to the ShipHawk Status Page * 6:23 - ShipHawk Engineering identified that the issue is happening only for the customers who use the latest bundle versions and that the issue is related to changes in how NetSuite processes NApiVersion=2.x scripts * 6:23 - case #4491650 was submitted to NetSuite Support * 10:23 - NetSuite team notified that they reverted changes but some of it is still stuck in the server cache – there is a chance the issue might self resolve if the partner’s cache is flushed * 13:02 - ShipHawk Engineering team prepared a new bundle version 2021.7.1 which intend to reset cached files * 13:30 - bundle 2021.7.1 was successfully tested and then pushed to some of the customer accounts where the fix was confirmed * 20:40 - the same issue was reported again **Friday 11/19** * 7:58 NetSuite team notified us about a critical defect 651122: `SuiteScript > RESTLet Script > TypeError: Class constructor CounterEntry cannot be invoked without 'new'` * 15:17 Defect 651122 was reported as fixed and deployed to all server * 22:30 ShipHawk team verified that the fix doesn't work even after a cache refreshed * 22:31 NetSuite Support Case #4491650 re-opened * 15:23 A new critical defect 651305 was created in NetSuite: `SuiteScript > RESTLet Script > TypeError: Class constructor CounterEntry cannot be invoked without 'new'` **Saturday 11/20** * 23:56 NetSuite pushed fixes for ShipHawk testing accounts **Sunday 11/21** * 10:30 NetSuite team confirmed that the fix is pushed to all accounts * 11:02 ShipHawk prepared the new bundle version 2021.7.2 which intend to reset cached files * 11:09 ShipHawk team verified the fix working and CS team helped customers to install the new bundle ## **Root cause identification:** 1. The customers were not able to sync orders and Item Fulfillments from NetSuite ShipHawk 2. Because NetSuite changed how they process scripts with NApiVersion 2.1 3. Because it was deployed by NetSuite without notice to find and resolve such issues without impacting customers 4. Because some ShipHawk scripts with Public scope return classes when they should use Same Account scope instead 5. Because ShipHawk does not have an alternative order and/or item fulfillments sync process for customers using the SuiteApp ## **Root cause** The instability we saw in customer accounts was introduced because some of our SuiteApp scripts with Public scope return classes. After the incident was resolved, NetSuite advised us that this is only supported for Same Account scope. Had we used this alternative scope, it may have mitigated the issue. ## **Lessons learned** * NetSuite may deploy changes to SuiteApp Developer tools without notice * NetSuite recommends we change scope of scripts in the bundle from Public to Same Account * ShipHawk needs to investigate alternative integration methods * ShipHawk needs to explore manual workarounds in the event an integration encounters an unplanned breaking change * ShipHawk needs to explore alternative syncing strategies to further mitigate risk ## **Corrective actions** * We will prioritize effort to change scope of scripts in the bundle from Public to Same Account per NetSuite’s recommendation * We will explore redundant and alternative syncing strategies to reduce reliance on changes made by integration partners
Status: Postmortem
Impact: Major | Started At: Nov. 18, 2021, 2:21 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.