Last checked: 8 minutes ago
Get notified about any outages, downtime or incidents for Avochato and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Avochato.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
API | Active |
avochato.com | Active |
Mobile | Active |
View the latest incidents for Avochato and check for official updates:
Description: ## What happened High concurrent outbound message volume caused our production write database to run out of connections. This caused most queued processes to take an extremely long time to finish, as well as page load times to time out for many users who tried accessing the platform during the impact period. ## Impact Pending messages, inbound messages, and broadcasts during this period may have remained queued but were not dropped. Inbound calls initiated to Avochato numbers during this period were often unable to connect or be forwarded properly. Upon resolution, inbound messages and queued work retried themselves and in most identifiable cases were received properly. ## Resolution Our database automatically failed over to a read replica and was able to resume serving requests, however we are investigating ways for this failover to happen sooner to prevent longer periods of inaccessibility. Our engineers have identified the root cause relating to message callback method prioritization, and we patched our production application servers with both a fix for the root cause as well as new safeguards to prevent excess resource consumption during periods of extreme load. We are evaluating solutions to make our infrastructure more resilient while continuing to offer a best in class live inbox experience for customers of all sizes. As a team, we have committed to aggressively monitoring our platform’s health and proactively deploying updates to bottlenecks detected in our current application. We appreciate the trust you place in our platform for communicating to those that matter most to you, and thank you for your patience during this busy time of the year. Thank you for choosing Avochato, Christopher Neale, CTO and Co-founder
Status: Postmortem
Impact: Major | Started At: Nov. 24, 2020, 10:41 p.m.
Description: ## What happened Our East Coast cloud infrastructure was routing requests to West Coast databases, sometimes multiple trips for a single request. This caused delays for customers whose DNS were automatically routed to the East Coast, as well as network requests from API servers with an East Coast region. Messages and application load times were delayed for customers closer to the East Coast region than the West Coast. ## Resolution We altered the threshold for sending traffic to the East Coast data center. We have rolled back networking changes to East Coast infrastructure and systems have returned to normal
Status: Postmortem
Impact: Minor | Started At: Nov. 23, 2020, 5:20 p.m.
Description: **What Happened** Starting in the afternoon, routine Conversation Management automation within the Avochato Platform began running on a disproportionately large body of background work using the default priority queue. This ultimately was due to a combination of account-specific settings, infrastructure restraints, and timing of the load across the Avochato platform. The Avochato platform suffered from growing latency in a series of waves, a short maintenance window of hard down-time, and another wave of latency as we addressed the root cause of the issue. All Avochato services were impacted. This lead to an exponential concurrent amount of background jobs performing and competing for all platform resources. **Ultimately, fixing the issue required putting the platform into maintenance mode while replacing hardware used in our cloud services.** To clarify, _this was not a planned or routine maintenance window, but the user experience was the same: app users would see a maintenance page \(or error page for some users\) and an inability to access the inbox. This was done in the interest of time and will be revised by the engineering team in the future._ During this period it was not clear where the source of runaway automation described above came from, but it caused the Avochato Platform to attempt to queue a new type of asynchronous job designed to push data to websockets. Because jobs and websockets use the same hardware, the influx basically ate up 100% of memory, as jobs that could not find available websockets could not complete and more and more jobs of that nature piled up waiting to publish to a websocket. The source of this issue specifically relates to a recent platform upgrade deployed in previous weeks to reduce the turnaround time for users to send messages and receive notifications quickly. While this functionally worked for our customer-base, it ultimately moved the burden to a different part of the architecture in a way that scaled disproportionately under specific circumstances, and without proper limitations on concurrent throughput. The result caused our platform to be unable to process additional web requests \(meaning high page load times\) and queued a massive excess of background jobs in a short period \(meaning delays in messages and lack of real-time notifications and inbox updates, etc\). Additionally, the latency and eventual outage led to our team being unable to respond to many customers who reached out to us during the impacted period in the timely manner that they have been accustomed, due to the platform failure. The Engineering team prepared and deployed a migration to switch those types of new jobs from the default priority queue into a new lower-priority queue to constrain their impact. Deployment of this patch was done per our usual high-availability deployment process which involves taking one-third of our application servers offline at a time, reducing platform capacity while we deploy. Regardless, in order to handle the overall volume of queued work and return to normalcy, Engineering applied emergency steps to replace the cloud computing instance storing the jobs with one twice its size but this could not be done without postponing the work as we switched the infrastructure. All efforts were made to prevent dropping the background jobs though ultimately not all jobs could be saved. Emergency steps to resolve the situation \(during which Avochato switched into maintenance mode in order to purge the system of the busy processes\) led to a short period of hard downtime and loss of queued jobs including processing contact CSV uploads, creating broadcast audiences, sending messages, and displaying notifications. Once the necessary hardware was replaced, the root source of the resource-intensive automation continued to create excess jobs. However, it gave engineers the ability to reduce the noise, identify the source, and design a final resolution to treat the cause instead of the symptom. Another migration was prepared to make it easy for admins to turn off functionality for specific sources of automation. Once deployed, systems administrators were able to eliminate the source of resource-intensive automations once and for all and new safeguards were installed for taking expedient, atomic actions in the future that would not require hardware or software deployments. This ultimately returned our systems to normal as of yesterday evening. **Next Steps** Engineering has drafted and is prioritizing a series of TODOs regarding infrastructure points of failure, is implementing in-app indicators for when the system is under similar periods of stress and is working closely to resolve any impacted accounts that got into a bad state due to the actions taken during the period. Infrastructure planning has been prioritized to reduce the burden on specific parts of our architecture and prevent specific architecture from bearing multiple responsibilities that led to the failure. We are continuing to monitor platform latency and take proactive steps to mitigate unforeseen combinations of Avochato automation from ever impacting the core inbox experience. We understand the level of trust you place in the Avochato Platform to communicate with those most important to you. On behalf of our team, thank you for your patience, and thank you for choosing Avochato, Christopher Neale, CTO and co-founder
Status: Postmortem
Impact: Critical | Started At: Nov. 19, 2020, 10:46 p.m.
Description: **What Happened** An upgrade to the client library prevented calls from being initialized inside the context of our mobile applications. This was unfortunately not detected by our QA process and resulted in a regression for app users regardless of mobile app version. **Actions Taken** We have patched the initialization of the library in all clients and we are re-evaluating the team’s ability to QA mobile applications against stage environments.
Status: Postmortem
Impact: Major | Started At: Nov. 18, 2020, 5:40 p.m.
Description: ## What happened A large spike in network requests combined with a backlog automated usage led to the Avochato platform queueing HTTP requests for a longer than average period of time. The resulting callbacks that resulted from the spike in usage created a large backlog of work to be done by our servers and led to page load times to spike and delays in processing sending messages. Subsequently, the load-balancer for our platform ran out of available connections for HTTP requests as websocket escalations piled up due to our users refreshing their browsers during the period of degraded performance. This caused a negative feedback loop leading to longer delays to process requests and connect to live updates, which then contributed to live updates for inboxes and conversations continueing to be intermittent and HTTP requests being dropped. ## Action items Specific bottlenecks in our platform infrastructure’s ability to broker websockets have been identified and implemented. Some additional updates to our asynchronous architecture are being planned and prioritized to prevent a similar incident in the future.
Status: Postmortem
Impact: Major | Started At: Oct. 28, 2020, 4:20 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.