Last checked: 6 minutes ago
Get notified about any outages, downtime or incidents for Deputy.com and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Deputy.com.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Billing Services | Active |
Deputy Website | Active |
Login Services | Active |
POS Integration | Active |
Sandbox | Active |
Deputy - All regions | Active |
Deputy - AU | Active |
Deputy - UK | Active |
Deputy - USA | Active |
Third Party Components | Active |
HelloSign / HelloFax HelloSign | Active |
HelloSign / HelloFax HelloSign API | Active |
Pusher Pusher.js CDN | Active |
Pusher Pusher REST API | Active |
Pusher WebSocket client API | Active |
Twilio Incoming SMS | Active |
Twilio Outgoing SMS | Active |
Xero Ltd - API Accounting API | Active |
Xero Ltd - API Payroll API | Active |
Zuora Production API Interface | Active |
Zuora Production API Interface | Active |
View the latest incidents for Deputy.com and check for official updates:
Description: This incident has been resolved.
Status: Resolved
Impact: Critical | Started At: Feb. 5, 2021, 12:30 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Major | Started At: Feb. 2, 2021, 2:51 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Major | Started At: Jan. 30, 2021, 7:20 a.m.
Description: # Details of the Deputy Outages in Australia on 25th January \(and 29th March\) ###### Deepesh Banerji, SVP Technology At Deputy, it is our vision to build thriving workplaces in every community. Trust and transparency is a key pillar that underpins a thriving workplace, which is why we are sharing with our valued customers some of the challenges we faced during our toughest outage to date \(disclaimer: this will get technical!\) On the 25th of January, a system failure caused a full platform outage for Deputy customers in Australia between the hours of 8:39am and 2:30pm \(6 hours\). This quickly led to an investment in the improvement of the underlying infrastructure of the Deputy platform. Our journey from outage to improvement has been detailed below. ## **The events of Monday the 25th January** January 25th is a unique and busy work day in Australia, especially for our customers. It’s the day before a public holiday, it’s the end of the month, and this year, it fell on a Monday, when many of our customers export timesheets and run payroll simultaneously. At 8:39am our automated system alerts triggered with _Alert: Heavy Response Times_. At the same time, our customer support team started receiving 100s of customer chats, indicating they had trouble accessing Deputy. The company triggered an Incident at this time, and updated our accompanying **status** page. Investigations began. Our software engineering team hadn’t released anything new that day, so no new code or infrastructure changes were present. Our code continued to pass all of our automated quality tests. Traffic to the login page had grown naturally through the morning, as expected. Meanwhile symptoms were surfacing. Our elastic servers kept adding and scaling more web servers to try to cope with increasing load. Digging one level deeper, our databases were seeing 10x-20x usual load. Continuing to dig, Redis, our in-memory cached database, which is normally used to drive high performance, was seeing an abnormally high amount of utilisation. It was at this point that we confirmed that Redis was the single point of failure → our scalable databases, and elastic web servers, were all waiting for our one Redis storage unit, resulting in a cascading failure. _Traffic Pattern Morning of 25th. Traffic looked like normal patterns._ ![](https://a.storyblok.com/f/64010/469x147/64703b2cf5/deputy-status-3.png) _Behind the scenes, each web server started seeing significant over-utilisation \(requests per instance\)_ ![](https://a.storyblok.com/f/64010/475x145/6ffacaa7d1/deputy-status-2.png) _Meanwhile, databases seeing significant load \(connections per database\)_ ![](https://a.storyblok.com/f/64010/475x163/acbce081a8/deputy-status-1.png) _**The root cause**: Redis CPU started showing signs of over utilisation after 8:30am, which in turn caused database and web server utilisation to hit an unsustainable peak \(utilisation % of Redis\)_ ![](https://a.storyblok.com/f/64010/475x131/d4770c898c/deputy-status-4.png) By midday, we had provisioned a new version of Redis, effectively restarted all of our processes and systems, and by 2:30pm, Deputy was again accessible to our customers. However, the Redis risk remained, lurking - provisioning a new version was a patch fix. In fact, it came back in a smaller, more controlled way on a few other occasions through the next few weeks. ## **So What Happened With Redis?** Deputy was using a non-scaling implementation Redis across an entire region \(i.e. Australia\) as a caching solution. As our customer base has grown, this created a single failure point, resulting in workloads becoming heavy and concentrated. We had outgrown our existing Redis architecture, and it was quickly made apparent to us that it was time to implement a more scalable solution. To make an analogy, we had 1 cash register for a very, very busy supermarket. Even as the supermarket got to peak capacity, we still had 1 cash register. In the new architecture, we have unlimited cash registers! Our team went hard at work, consulting with our AWS enterprise architecture team, working nights and weekends, to develop a scalable, distributed Redis. In this new architecture, our infrastructure now has 10x the redis clusters to effectively spread and orchestrate workloads, and we continue to add new clusters as our customer base grows. In short, our infrastructure now reflects our requirements for today’s customers and future proofs us for our growth ambitions. ## **29th March, The False Start** On Monday the 29th, we released the new scalable and distributed Redis to all customers, with the intent to resolve these issues once and for all. However,, as irony would have it, this inadvertently led to another outage on 29th of March, due to settings of how the new system was tuned, which was quickly resolved. ## **19th April, All Systems Go!** The previous outage on the 29th was a growing pain and speed bump to deliver the full working solution that is in production now, and handling usage elegantly and with ease! This incident was a key catalyst in driving the constant journey we’ve been on to improve system resilience and systematically removing any single points of failure that may exist as our customer utilisation expands. ## **What Else Has Happened to improve our up-time and customer experience?** 1. Redis has been reworked and re-architected 2. Increased Monitoring, alerts and logs have been introduced in the application 3. Circuit Breakers have been implemented to reduce likelihood of cascading failures 4. Elastic computing scaling rules have been adjusted to better handle scale up when required ## **Conclusion** We understand this was an upsetting outage for our customers, especially a payroll day before a public holiday. We responded quickly to correct the situation, and have systematically dealt with Redis scalability as the root cause. Thank you for your patience and understanding. We do not take for granted the trust you have placed on Deputy. We will continue to be on a journey to make Deputy highly available and your trusted partner!
Status: Postmortem
Impact: Critical | Started At: Jan. 24, 2021, 10:19 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Major | Started At: Jan. 7, 2021, 2:26 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.