Last checked: a minute ago
Get notified about any outages, downtime or incidents for Qwilr and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Qwilr.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Blueprint API | Active |
Dashboard search | Active |
Published Qwilr Pages | Active |
Qwilr App | Active |
Qwilr Website | Active |
View the latest incidents for Qwilr and check for official updates:
Description: Everything has been stable for the last day and we're not expecting any further problems now that the root cause has been fixed. If you do see any issues with your Qwilr account, reach out to [email protected]
Status: Resolved
Impact: None | Started At: June 11, 2019, 3:32 a.m.
Description: On Tuesday June 11th at approximately 10.30am AEST, Qwilr experienced serious issues with site reliability and many users experienced failures in using the application and delivery of content to customers. Qwilr’s engineering team investigated the issue and observed spikes in CPU on some of our webserver instances, but nothing that should cause the 502 and 504 errors customers reported. Eventually we could observe that some of our NodeJS docker Pods \(we run in Kubernetes\) were hitting 100% CPU and with further investigation could see that these processes were taking up to 30 minutes to process a single request. The cause of this turned out to be a very large payload sent to our API, causing that request to take up to 30 minutes. Part of this was a result of having code that was designed to run fast for small payloads but didn’t handle this large payload. It filled up the memory allocated to the Pod and caused the CPU to go to 100%. Combined with this, as a consequence of recently moving infrastructure from Rackspace to AWS, our Kubernetes Pods lacked readiness checks that would ensure traffic not be routed to them when not responsive. This meant requests to these Pods would time out and return 502 or 504s. By 6pm AEST on the 11th, we deployed a code fix to resolve the root cause and ensure that these Pods could process such a large payload in approximately 1/10th of the time and also set up a readiness check to ensure our system is more robust. We are also working with our API customers to find a sensible limit to payload sizes. As a result of this issue we are confident that our system has been made more stable and resilient for the future.
Status: Postmortem
Impact: Major | Started At: June 11, 2019, 12:46 a.m.
Description: On Tuesday June 11th at approximately 10.30am AEST, Qwilr experienced serious issues with site reliability and many users experienced failures in using the application and delivery of content to customers. Qwilr’s engineering team investigated the issue and observed spikes in CPU on some of our webserver instances, but nothing that should cause the 502 and 504 errors customers reported. Eventually we could observe that some of our NodeJS docker Pods \(we run in Kubernetes\) were hitting 100% CPU and with further investigation could see that these processes were taking up to 30 minutes to process a single request. The cause of this turned out to be a very large payload sent to our API, causing that request to take up to 30 minutes. Part of this was a result of having code that was designed to run fast for small payloads but didn’t handle this large payload. It filled up the memory allocated to the Pod and caused the CPU to go to 100%. Combined with this, as a consequence of recently moving infrastructure from Rackspace to AWS, our Kubernetes Pods lacked readiness checks that would ensure traffic not be routed to them when not responsive. This meant requests to these Pods would time out and return 502 or 504s. By 6pm AEST on the 11th, we deployed a code fix to resolve the root cause and ensure that these Pods could process such a large payload in approximately 1/10th of the time and also set up a readiness check to ensure our system is more robust. We are also working with our API customers to find a sensible limit to payload sizes. As a result of this issue we are confident that our system has been made more stable and resilient for the future.
Status: Postmortem
Impact: Major | Started At: June 11, 2019, 12:46 a.m.
Description: The server issues have been resolved.
Status: Resolved
Impact: None | Started At: June 10, 2019, 4 a.m.
Description: Service has been restored.
Status: Resolved
Impact: None | Started At: June 10, 2019, 3:43 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.