Last checked: 31 seconds ago
Get notified about any outages, downtime or incidents for PandaDoc and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for PandaDoc.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
EU | Active |
API | Active |
Creating and editing documents | Active |
CRMs & Integrations | Active |
Mobile application | Active |
Public (recipient) view | Active |
Sending and opening documents | Active |
Signup | Active |
Uploading and downloading documents | Active |
Web application | Active |
Webhooks | Active |
Website | Active |
US & Global | Active |
API | Active |
Creating and editing documents | Active |
CRMs & Integrations | Active |
Mobile application | Active |
Public (recipient) view | Active |
Sending and opening documents | Active |
Signup | Active |
Uploading and downloading documents | Active |
Web application | Active |
Webhooks | Active |
Website | Active |
View the latest incidents for PandaDoc and check for official updates:
Description: ## A summary of what happened At **14:01 PDT Friday, April 7th** our monitoring indicated that our Public API request rate dropped and health checks didn’t pass through. The situation deteriorated rapidly and we noticed that some of our API endpoints became unresponsive, which impacted the availability of the PandaDoc platform. We followed our protocol and immediately started our incident response procedure, rolled back recent updates, and involved engineers in multiple investigation paths. After we had dismissed some initial theories, we understood that we had an issue connected with something on the infrastructure level and started investigating this together with our cloud provider \(AWS\). After a deep investigation that lasted several hours, we were able to track down the issue to network problems: several pods on a specific Kubernetes node were experiencing intermittent low-level network issues that caused connection leaks - repeatedly opening connections without closing them, or at least closing only some of them - which eventually led to increased latency and memory consumption and resulted in some of our core services entering a chain of crashes. As a consequence, Application, and API were not available during the downtime. Once the root cause was identified, the broken machine was removed from the cluster and the system started operating normally. The issue was fully resolved by **01:23 PDT, April 08**. ## A deep dive - how we investigated the root cause When the incident started, we noticed a spike in the number of connections in our database pool and many API calls waiting for connections to be released so they could process incoming requests. We quickly figured out that what was stopping connections from being released was a large series of uncommitted transactions that were just waiting idle. We then started analyzing database locks and deadlocks because it is usually what might lead to this behavior and wrote a hotfix to one of our API endpoints to reduce the number of processed events expecting this would release connections faster. Soon after this, we understood that the database was not a bottleneck, although we still had stalled transactions growing and the connections in the pool being taken and not released. We ran a deeper analysis of API endpoints metrics which revealed that external calls within the transactions could be the culprit. After more investigation, we found similarities in the API calls that were not responsive - they all interacted with our message queue \(RabbitMQ HA cluster\). The RabbitMQ cluster was working without any disruptions for the last 1.5 years, and monitoring was not showing anything suspicious. It did not seem like a cause since queues are processing messages independently in async mode \(that’s why they are used to offload tasks to be executed asynchronously later\), but we still decided to look into it closer. After analyzing machines in the clusters and connecting to them directly we saw that they were shutting down and reloading periodically, although this was not visible in the cluster monitoring in our Grafana dashboards, nor did we get any alerts. Since the message queue was unresponsive it led API calls to sit and wait for connection which led to blocking transactions which led to increasing in blocked connections in the database connection pool which led to waiting for other API requests for a new connection forever, in a loop that caused a chain of failures. We immediately started addressing the situation by scaling the cluster up vertically and upgrading the machines it's running on with more processing power and networking capabilities. After the upgrade, we’ve added additional monitoring metrics to the cluster. In parallel, we were leading an investigation into a probable root cause: intermittent networking issues on a kubernetes node that were causing pods on that node to repeatedly open connections without closing them. We investigated deeper and realized that an underlying networking issue was the most probable root cause after we observed and correlated several facts: * We had randomly missing metrics in our Prometheus monitoring relative to several systems, coinciding in time with the degradation of the RabbitMQ cluster metrics \(the number of sockets started growing linearly\) * We found that all pods on one particular Kubernetes node \(that was added to the cluster on Friday morning\) were having trouble connecting to other parts of the system \(our NATS cluster\). We also noticed error patterns in logs related to closed network connections or client timeouts, in numbers higher than normal. At the same time, we also observed that the number of slow NATS consumers was growing abnormally since the start of the incident * Most of the connections to the RabbitMQ nodes during the incident period were coming from pods that were residing in the faulty node. Once the broken machine was removed from the cluster, the system started operating normally. To sum up: we consider the main cause of the incident to be a problem with an AWS EC2 instance provisioned as part of our EKS \(managed k8s cluster\) that occurred during the normal process of a release. There were network-related errors that caused a number of connection issues on the RabbitMQ cluster leading to a chain failure. ## What we have done and will be doing next As our investigation wraps up, we want to highlight our continuous improvement mindset, and to provide clarity on what we are doing to improve our systems: * We have improved the robustness and scale of our rabbitMQ cluster to reduce the likelihood of failure in case of a growing number of network connections and reviewed the HA RabbitMQ setup and its replication settings * We’ve added additional logging and metrics to our RabbitMQ cluster, as well as early detection alarms for any deviation in network traffic patterns for the cluster * We’ve engaged AWS in the investigation and resolution of this outage. AWS support is running their own investigation about the issue * We’ll do further improvements in our Observability stack, with a review of which additional metrics we can add to improve the detection of underlying problems in AWS-managed services \(e.g. EKS\), reduce alerting noise and ensure certain alerts are highlighted \(RabbitMQ / failing pods\) * As an additional step to prevent this in the future, we’re planning to review all the external calls in our API handlers and move them to a transactional outbox to avoid blocking transactions if external services become unavailable.
Status: Postmortem
Impact: Critical | Started At: April 7, 2023, 9:01 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Critical | Started At: March 31, 2023, 4:08 p.m.
Description: We're all set! If you continue to experience any issues with this, please reach out to us at [email protected].
Status: Resolved
Impact: Major | Started At: March 20, 2023, 5:15 p.m.
Description: This incident has ben resolved.
Status: Resolved
Impact: Minor | Started At: March 16, 2023, 12:10 a.m.
Description: We're all set! Users can now create documents via API. If you continue to experience any issues with this, please reach out to us at [email protected].
Status: Resolved
Impact: Major | Started At: Feb. 13, 2023, 5:23 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.