Zencargo Status: Check if Zencargo down or having an outage.

Zencargo outages and incidents

Outage and incident data over the last 30 days for Zencargo.

There have been 0 outages or incidents for Zencargo in the last 30 days.

Severity Breakdown:

None: 0

Minor: 0

Major: 0

Critical: 0

Tired of searching for status updates?

Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!

Components and Services Monitored for Zencargo

Outlogger tracks the status of these components for Xero:

Analytics Dashboards Active

API Active

app.zencargo.com Active

sandbox.zencargo.com Active

Component	Status
Analytics Dashboards	Active
API	Active
app.zencargo.com	Active
sandbox.zencargo.com	Active

Latest Zencargo outages and incidents.

View the latest incidents for Zencargo and check for official updates:

Our application is timing out

Description: # **Incident summary** Between Friday, May 15th 2020 between 19:55 and Saturday 17:01 we had 7 incidents on our application with a total amount of 4 hours and 56 minutes which affected 5 users that used our application. The incident was caused by introducing a new background job processing technology which exceeded our memory on our application servers and made them unavailable to our customers \(full down time\). # Impact 5 customers were affected in a total downtime of 4 hours and 56 \(mostly during the night\). There was no further impact \(no mentions by the team, no social media mentions, no calls to our KAM team\) in relation to this incident. The event was triggered by a change to our background processing infrastructure on Friday, which caused a memory leak and the interruption of our service at the following times: * 19:55 to 20:01 \(6min\) * 20:02 to 20:04 \(2min\) * 20:07 to 20:31 \(24min\) * 23:25 to 00:04 \(39min\) * 03:25 to 03:30 \(5min\) * 03:32 to 07:04 \(3h32min\) * 16:53 to 17:01 \(8min\) To put this in context, our downtime has been 2 hours in total in the last year. We usually do better and I am sorry for the impact this had on your business. We’re doing everything to have your back in the future. # Leadup The change of the background processing infrastructure was part of introducing changes to simplify our Kubernetes migration, by getting rid of cronjobs and move to [Sidekiq](https://sidekiq.org), which is considered the industry standard for background processing in the ruby world. A bug in our code caused multiple background processes being processed at the same time, duplication on the same host and simultaneously across two instances. The team started working on the event by firstly making sure that our application is reachable. # **Detection** We started a tripple pairing session to debug the situation by checking our monitoring infrastructure. We did try to find clues of what the issue was, but because we couldn't log in to both instances, we couldn't find out what the root cause was. We made an assumption that it's either 1. full disk issues due to logs 2. memory leaks 3. a network outage at AWS We couldn't prove 1, but scheduled a review of the instance in 2 weeks. We couldn't prove 2, so we thought we'll have to wait for what happened, the last deployment was 5 hours ago, so we assumed that this might not be related to the deployment, that's why we assumed 3, which is out of our control and we thought it might have been a temporary issue. We decided after an hour of investigation without results to monitor the new instances that have been deployed. We've seen something weird, a background process where no jobs are running consuming 80% of the system memory: `> ps aux --sort -rss | head -n 2 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND webapp 17417 27.8 79.4 7838288 6490376 ? Sl 11:51 37:28 sidekiq 5.2.7 current [0 of 5 busy]` We've seen memory consumptions up to 92% of available memory by one process. We've seen the same job being run on the same and across multiple machines, in parallel. # Response We started introducing additional metrics logging which would give us memory reporting for the application even if the application would no longer be accessible. We started reconfiguring the background processes so that they don't run with multiple instances at the same time where it made sense \(especially the ocean insights subscription task\). We started kicking off a Garbage Collection process to win back the memory that was allocated for the background jobs after the job is finished. # Recovery We deployed the fixes mentioned in `Response`. Our memory consumption has been stable for more than 3 hours on < 60% for the background process which we monitored. We discovered an inefficient background process that needs to be addressed in an upcoming cooldown cycle. ## **Root cause** 1. Moving background processes from a short running process \(cron job\) to a long running process \(sidekiq worker\) has an implication on the memory management of the ruby process, which led to a memory leak that crashed the application server \(vs. one cron job as a process manager that was run as a sub-process that freed up memory after the job finished\). 2. Previously we had only one machine executing the cron jobs \(the deployment lead\), even if one machine would've ran out of memory another machine would have been unaffected. 3. The deployment of a major infrastructure piece was done on a Friday mid-day. The issue surfaced out of business hours. 4. Lack of monitoring and logs about the instance health \(memory and disk space metrics\) made it hard to reason and hard to detect the issue which led to assumptions that weren't correct, misidentifying the issue at hand. ## **Backlog check** There has been no ticket in our technical debt project that raised moving the background job infrastructure. This was mostly triggered by our Kubernetes objective to simplify the background job management without proper risk assessment and understanding of the existing background jobs. The understanding of Sidekiq and it's impact on the production environment were unknown to us. ## **Recurrence** We haven't seen the same root cause popping up in the past. ## **Lessons learned** * The team was responsive and we came together to solve the issue without any process being in place, which is a great sign that peer accountability and ownership are lived as values in the team. * The lack of visibility of instance health and application state in our metrics hurted us, but we're happy that we're working on transitioning to Kubernetes because this would've been caught in different ways * background jobs being managed by kubernetes independently from the application instances which would've had an impact on the background job without affecting our application to go down * We’re addressing all our outstanding monitoring shortcomings as part of the Kubernetes transition * Prometheus, Grafana and Kibana stack * Istio telemetry \(e.g. application metrics, improved log management, distributed tracing\) * Pagerduty for only one person doesn't make sense no more, we’ve scaled a lot and we’re going to address this to make sure we have no single point of failure. ## **Corrective actions** Describe the corrective action ordered to prevent this class of incident in the future. Note who is responsible and when they have to complete the work and where that work is being tracked. * Manual auto-scaling rate limit increase which is in place to be trained up for all engineers to fix uptime pragmatically * setup pagerduty devices for more people in the team \(plus rotation\) * Review Sidekiq implementation and review queueing strategy * Review Memory leak strategy \(monit or similar\) to kill processes that take too much memory until Kubernetes is in place

Status: Postmortem

Impact: Critical | Started At: May 15, 2020, 7:25 p.m.

Updates:

Time: May 16, 2020, 10:33 p.m.

Status: Postmortem

Update: # **Incident summary** Between Friday, May 15th 2020 between 19:55 and Saturday 17:01 we had 7 incidents on our application with a total amount of 4 hours and 56 minutes which affected 5 users that used our application. The incident was caused by introducing a new background job processing technology which exceeded our memory on our application servers and made them unavailable to our customers \(full down time\). # Impact 5 customers were affected in a total downtime of 4 hours and 56 \(mostly during the night\). There was no further impact \(no mentions by the team, no social media mentions, no calls to our KAM team\) in relation to this incident. The event was triggered by a change to our background processing infrastructure on Friday, which caused a memory leak and the interruption of our service at the following times: * 19:55 to 20:01 \(6min\) * 20:02 to 20:04 \(2min\) * 20:07 to 20:31 \(24min\) * 23:25 to 00:04 \(39min\) * 03:25 to 03:30 \(5min\) * 03:32 to 07:04 \(3h32min\) * 16:53 to 17:01 \(8min\) To put this in context, our downtime has been 2 hours in total in the last year. We usually do better and I am sorry for the impact this had on your business. We’re doing everything to have your back in the future. # Leadup The change of the background processing infrastructure was part of introducing changes to simplify our Kubernetes migration, by getting rid of cronjobs and move to [Sidekiq](https://sidekiq.org), which is considered the industry standard for background processing in the ruby world. A bug in our code caused multiple background processes being processed at the same time, duplication on the same host and simultaneously across two instances. The team started working on the event by firstly making sure that our application is reachable. # **Detection** We started a tripple pairing session to debug the situation by checking our monitoring infrastructure. We did try to find clues of what the issue was, but because we couldn't log in to both instances, we couldn't find out what the root cause was. We made an assumption that it's either 1. full disk issues due to logs 2. memory leaks 3. a network outage at AWS We couldn't prove 1, but scheduled a review of the instance in 2 weeks. We couldn't prove 2, so we thought we'll have to wait for what happened, the last deployment was 5 hours ago, so we assumed that this might not be related to the deployment, that's why we assumed 3, which is out of our control and we thought it might have been a temporary issue. We decided after an hour of investigation without results to monitor the new instances that have been deployed. We've seen something weird, a background process where no jobs are running consuming 80% of the system memory: `> ps aux --sort -rss | head -n 2 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND webapp 17417 27.8 79.4 7838288 6490376 ? Sl 11:51 37:28 sidekiq 5.2.7 current [0 of 5 busy]` We've seen memory consumptions up to 92% of available memory by one process. We've seen the same job being run on the same and across multiple machines, in parallel. # Response We started introducing additional metrics logging which would give us memory reporting for the application even if the application would no longer be accessible. We started reconfiguring the background processes so that they don't run with multiple instances at the same time where it made sense \(especially the ocean insights subscription task\). We started kicking off a Garbage Collection process to win back the memory that was allocated for the background jobs after the job is finished. # Recovery We deployed the fixes mentioned in `Response`. Our memory consumption has been stable for more than 3 hours on < 60% for the background process which we monitored. We discovered an inefficient background process that needs to be addressed in an upcoming cooldown cycle. ## **Root cause** 1. Moving background processes from a short running process \(cron job\) to a long running process \(sidekiq worker\) has an implication on the memory management of the ruby process, which led to a memory leak that crashed the application server \(vs. one cron job as a process manager that was run as a sub-process that freed up memory after the job finished\). 2. Previously we had only one machine executing the cron jobs \(the deployment lead\), even if one machine would've ran out of memory another machine would have been unaffected. 3. The deployment of a major infrastructure piece was done on a Friday mid-day. The issue surfaced out of business hours. 4. Lack of monitoring and logs about the instance health \(memory and disk space metrics\) made it hard to reason and hard to detect the issue which led to assumptions that weren't correct, misidentifying the issue at hand. ## **Backlog check** There has been no ticket in our technical debt project that raised moving the background job infrastructure. This was mostly triggered by our Kubernetes objective to simplify the background job management without proper risk assessment and understanding of the existing background jobs. The understanding of Sidekiq and it's impact on the production environment were unknown to us. ## **Recurrence** We haven't seen the same root cause popping up in the past. ## **Lessons learned** * The team was responsive and we came together to solve the issue without any process being in place, which is a great sign that peer accountability and ownership are lived as values in the team. * The lack of visibility of instance health and application state in our metrics hurted us, but we're happy that we're working on transitioning to Kubernetes because this would've been caught in different ways * background jobs being managed by kubernetes independently from the application instances which would've had an impact on the background job without affecting our application to go down * We’re addressing all our outstanding monitoring shortcomings as part of the Kubernetes transition * Prometheus, Grafana and Kibana stack * Istio telemetry \(e.g. application metrics, improved log management, distributed tracing\) * Pagerduty for only one person doesn't make sense no more, we’ve scaled a lot and we’re going to address this to make sure we have no single point of failure. ## **Corrective actions** Describe the corrective action ordered to prevent this class of incident in the future. Note who is responsible and when they have to complete the work and where that work is being tracked. * Manual auto-scaling rate limit increase which is in place to be trained up for all engineers to fix uptime pragmatically * setup pagerduty devices for more people in the team \(plus rotation\) * Review Sidekiq implementation and review queueing strategy * Review Memory leak strategy \(monit or similar\) to kill processes that take too much memory until Kubernetes is in place
Time: May 15, 2020, 7:50 p.m.

Status: Resolved

Update: We identified the issue which was related to running out of memory on our application servers. We consider the incident resolved and apologise for the inconvenience caused.
Time: May 15, 2020, 7:43 p.m.

Status: Monitoring

Update: We had a loss of network connectivity to our application servers. We resolve the issue by starting new instances and are investigating the issue
Time: May 15, 2020, 7:33 p.m.

Status: Investigating

Update: We're back online and are still investigating the issues.
Time: May 15, 2020, 7:25 p.m.

Status: Investigating

Update: We started investigating this issue.

Issues with the shipments table

Description: This incident has been resolved.

Status: Resolved

Impact: Minor | Started At: March 19, 2020, 6:46 p.m.

Updates:

Time: March 19, 2020, 8:08 p.m.

Status: Resolved

Update: This incident has been resolved.
Time: March 19, 2020, 6:50 p.m.

Status: Identified

Update: The issue has been identified and a fix is being implemented.
Time: March 19, 2020, 6:46 p.m.

Status: Investigating

Update: We are investigating intermittent issues with the shipments page

Issues with the shipments table

Description: This incident has been resolved.

Status: Resolved

Impact: Minor | Started At: March 19, 2020, 6:46 p.m.

Updates:

Time: March 19, 2020, 8:08 p.m.

Status: Resolved

Update: This incident has been resolved.
Time: March 19, 2020, 6:50 p.m.

Status: Identified

Update: The issue has been identified and a fix is being implemented.
Time: March 19, 2020, 6:46 p.m.

Status: Investigating

Update: We are investigating intermittent issues with the shipments page

Issues with the shipments page

Description: This incident has been resolved, functionality has been restored. Thanks for your patience

Status: Resolved

Impact: Minor | Started At: Feb. 17, 2020, 9:32 a.m.

Updates:

Time: Feb. 17, 2020, 2:18 p.m.

Status: Resolved

Update: This incident has been resolved, functionality has been restored. Thanks for your patience
Time: Feb. 17, 2020, 1:35 p.m.

Status: Monitoring

Update: We've pushed a fix and are monitoring
Time: Feb. 17, 2020, 1:30 p.m.

Status: Identified

Update: We've temporarily removed some filters to reduce issues. We'll bring these back as soon as we have a fix for the underlying problem. In the meantime, please use the global search or purchase orders page to find older shipments.
Time: Feb. 17, 2020, 11:22 a.m.

Status: Identified

Update: We've pushed a fix that was causing issues with table sorting. We're still monitoring some intermittent issues some customers are experiencing and working on a fix.
Time: Feb. 17, 2020, 9:32 a.m.

Status: Identified

Update: We have identified an intermittent issue with our shipments page and are working on a fix. Thanks for your patience.

Issues with the shipments page

Description: This incident has been resolved, functionality has been restored. Thanks for your patience

Status: Resolved

Impact: Minor | Started At: Feb. 17, 2020, 9:32 a.m.

Updates:

Time: Feb. 17, 2020, 2:18 p.m.

Status: Resolved

Update: This incident has been resolved, functionality has been restored. Thanks for your patience
Time: Feb. 17, 2020, 1:35 p.m.

Status: Monitoring

Update: We've pushed a fix and are monitoring
Time: Feb. 17, 2020, 1:30 p.m.

Status: Identified

Update: We've temporarily removed some filters to reduce issues. We'll bring these back as soon as we have a fix for the underlying problem. In the meantime, please use the global search or purchase orders page to find older shipments.
Time: Feb. 17, 2020, 11:22 a.m.

Status: Identified

Update: We've pushed a fix that was causing issues with table sorting. We're still monitoring some intermittent issues some customers are experiencing and working on a fix.
Time: Feb. 17, 2020, 9:32 a.m.

Status: Identified

Update: We have identified an intermittent issue with our shipments page and are working on a fix. Thanks for your patience.

Check the status of similar companies and alternatives to Zencargo

NetSuite

Systems Active

ZoomInfo

Systems Active

SPS Commerce

Systems Active

Miro

Systems Active

Field Nation

Systems Active

Outreach

Systems Active

Own Company

Systems Active

Mindbody

Systems Active

TaskRabbit

Systems Active

Nextiva

Systems Active

6Sense

Systems Active

BigCommerce

Systems Active

Frequently Asked Questions - Zencargo

Is there a Zencargo outage?

The current status of Zencargo is: Systems Active

Where can I find the official status page of Zencargo?

The official status page for Zencargo is here

How can I get notified if Zencargo is down or experiencing an outage?

To get notified of any status changes to Zencargo, simply sign up to OutLogger's free monitoring service. OutLogger checks the official status of Zencargo every few minutes and will notify you of any changes. You can veiw the status of all your cloud vendors in one dashboard. Sign up here

Is there an Zencargo outage?

Zencargo status: Systems Active

Zencargo outages and incidents

There have been 0 outages or incidents for Zencargo in the last 30 days.

Severity Breakdown:

Tired of searching for status updates?

Components and Services Monitored for Zencargo

Latest Zencargo outages and incidents.

Our application is timing out

Updates:

Issues with the shipments table

Updates:

Issues with the shipments table

Updates:

Issues with the shipments page

Updates:

Issues with the shipments page

Updates:

Check the status of similar companies and alternatives to Zencargo

NetSuite

ZoomInfo

SPS Commerce

Miro

Field Nation

Outreach

Own Company

Mindbody

TaskRabbit

Nextiva

6Sense

BigCommerce

Frequently Asked Questions - Zencargo

Is there a Zencargo outage?

Where can I find the official status page of Zencargo?

How can I get notified if Zencargo is down or experiencing an outage?

Start monitoring now!