Last checked: 5 minutes ago
Get notified about any outages, downtime or incidents for Zencargo and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Zencargo.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Analytics Dashboards | Active |
API | Active |
app.zencargo.com | Active |
sandbox.zencargo.com | Active |
View the latest incidents for Zencargo and check for official updates:
Description: # **Incident summary** Between Friday, May 15th 2020 between 19:55 and Saturday 17:01 we had 7 incidents on our application with a total amount of 4 hours and 56 minutes which affected 5 users that used our application. The incident was caused by introducing a new background job processing technology which exceeded our memory on our application servers and made them unavailable to our customers \(full down time\). # Impact 5 customers were affected in a total downtime of 4 hours and 56 \(mostly during the night\). There was no further impact \(no mentions by the team, no social media mentions, no calls to our KAM team\) in relation to this incident. The event was triggered by a change to our background processing infrastructure on Friday, which caused a memory leak and the interruption of our service at the following times: * 19:55 to 20:01 \(6min\) * 20:02 to 20:04 \(2min\) * 20:07 to 20:31 \(24min\) * 23:25 to 00:04 \(39min\) * 03:25 to 03:30 \(5min\) * 03:32 to 07:04 \(3h32min\) * 16:53 to 17:01 \(8min\) To put this in context, our downtime has been 2 hours in total in the last year. We usually do better and I am sorry for the impact this had on your business. We’re doing everything to have your back in the future. # Leadup The change of the background processing infrastructure was part of introducing changes to simplify our Kubernetes migration, by getting rid of cronjobs and move to [Sidekiq](https://sidekiq.org), which is considered the industry standard for background processing in the ruby world. A bug in our code caused multiple background processes being processed at the same time, duplication on the same host and simultaneously across two instances. The team started working on the event by firstly making sure that our application is reachable. # **Detection** We started a tripple pairing session to debug the situation by checking our monitoring infrastructure. We did try to find clues of what the issue was, but because we couldn't log in to both instances, we couldn't find out what the root cause was. We made an assumption that it's either 1. full disk issues due to logs 2. memory leaks 3. a network outage at AWS We couldn't prove 1, but scheduled a review of the instance in 2 weeks. We couldn't prove 2, so we thought we'll have to wait for what happened, the last deployment was 5 hours ago, so we assumed that this might not be related to the deployment, that's why we assumed 3, which is out of our control and we thought it might have been a temporary issue. We decided after an hour of investigation without results to monitor the new instances that have been deployed. We've seen something weird, a background process where no jobs are running consuming 80% of the system memory: `> ps aux --sort -rss | head -n 2 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND webapp 17417 27.8 79.4 7838288 6490376 ? Sl 11:51 37:28 sidekiq 5.2.7 current [0 of 5 busy]` We've seen memory consumptions up to 92% of available memory by one process. We've seen the same job being run on the same and across multiple machines, in parallel. # Response We started introducing additional metrics logging which would give us memory reporting for the application even if the application would no longer be accessible. We started reconfiguring the background processes so that they don't run with multiple instances at the same time where it made sense \(especially the ocean insights subscription task\). We started kicking off a Garbage Collection process to win back the memory that was allocated for the background jobs after the job is finished. # Recovery We deployed the fixes mentioned in `Response`. Our memory consumption has been stable for more than 3 hours on < 60% for the background process which we monitored. We discovered an inefficient background process that needs to be addressed in an upcoming cooldown cycle. ## **Root cause** 1. Moving background processes from a short running process \(cron job\) to a long running process \(sidekiq worker\) has an implication on the memory management of the ruby process, which led to a memory leak that crashed the application server \(vs. one cron job as a process manager that was run as a sub-process that freed up memory after the job finished\). 2. Previously we had only one machine executing the cron jobs \(the deployment lead\), even if one machine would've ran out of memory another machine would have been unaffected. 3. The deployment of a major infrastructure piece was done on a Friday mid-day. The issue surfaced out of business hours. 4. Lack of monitoring and logs about the instance health \(memory and disk space metrics\) made it hard to reason and hard to detect the issue which led to assumptions that weren't correct, misidentifying the issue at hand. ## **Backlog check** There has been no ticket in our technical debt project that raised moving the background job infrastructure. This was mostly triggered by our Kubernetes objective to simplify the background job management without proper risk assessment and understanding of the existing background jobs. The understanding of Sidekiq and it's impact on the production environment were unknown to us. ## **Recurrence** We haven't seen the same root cause popping up in the past. ## **Lessons learned** * The team was responsive and we came together to solve the issue without any process being in place, which is a great sign that peer accountability and ownership are lived as values in the team. * The lack of visibility of instance health and application state in our metrics hurted us, but we're happy that we're working on transitioning to Kubernetes because this would've been caught in different ways * background jobs being managed by kubernetes independently from the application instances which would've had an impact on the background job without affecting our application to go down * We’re addressing all our outstanding monitoring shortcomings as part of the Kubernetes transition * Prometheus, Grafana and Kibana stack * Istio telemetry \(e.g. application metrics, improved log management, distributed tracing\) * Pagerduty for only one person doesn't make sense no more, we’ve scaled a lot and we’re going to address this to make sure we have no single point of failure. ## **Corrective actions** Describe the corrective action ordered to prevent this class of incident in the future. Note who is responsible and when they have to complete the work and where that work is being tracked. * Manual auto-scaling rate limit increase which is in place to be trained up for all engineers to fix uptime pragmatically * setup pagerduty devices for more people in the team \(plus rotation\) * Review Sidekiq implementation and review queueing strategy * Review Memory leak strategy \(monit or similar\) to kill processes that take too much memory until Kubernetes is in place
Status: Postmortem
Impact: Critical | Started At: May 15, 2020, 7:25 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: March 19, 2020, 6:46 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: March 19, 2020, 6:46 p.m.
Description: This incident has been resolved, functionality has been restored. Thanks for your patience
Status: Resolved
Impact: Minor | Started At: Feb. 17, 2020, 9:32 a.m.
Description: This incident has been resolved, functionality has been restored. Thanks for your patience
Status: Resolved
Impact: Minor | Started At: Feb. 17, 2020, 9:32 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.