Last checked: 2 minutes ago
Get notified about any outages, downtime or incidents for Field Nation and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Field Nation.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
API | Active |
Marketing Website | Active |
Mobile App | Active |
Out of the box Integration connectors | Active |
Web App | Active |
View the latest incidents for Field Nation and check for official updates:
Description: This incident has been resolved.
Status: Resolved
Impact: None | Started At: Jan. 17, 2023, 9:58 p.m.
Description: #### Summary A failure of a central message queueing service, responsible for delivering event messages across application services that make up our system, caused a widespread outage of the Field Nation platform. This outage extended a significant amount of time at 3.5 hours. While our team believed to have mitigated the extent of the impact, they learned a significant portion of time into the incident that the mitigation was less effective than initially believed. We’re sincerely sorry for the disruption and the length of the duration. We understand the importance Field Nation plays for our service providers and buyers in how they get work done, and work hard to ensure technical issues don’t get in the way of that. The incident was caused by a single degraded node. We are implementing additional monitors to more quickly identify such a problem in the future, and also improving our internal operating documentation to give better guidance for much quicker resolution to this nature of system failure in the future. #### What Happened On Jan 9th at 09:17 we were alerted to a health issue with one of the three nodes that make up our central message queueing service. The alert notified us that it did not reply to an automated health check and was likely down. This is a service that enables application services that make up our platform to be aware of events occurring within the platform as well as queue work to be background processed. Upon our team’s investigation into this alert, they were unable to determine an issue and the node appeared to be functioning with healthy metrics. The alert was resolved an hour later and the team assumed it to be a false alarm. At 10:30 a routine deployment was made for a minor update to one of our application services. 15 minutes later at 10:45, our team received an alarm about high memory usage on the same message queueing service node that earlier had the failed health check. The team then observed some key application services were reporting as unhealthy and the platform website was no longer loading. The team decided at 11:00 to roll back the changes deployed at 10:30. Although there was nothing in the changes that related to the issue, the timing of the change appeared to correlate. After rolling back those changes there was no sign of improvement. It was then observed the message queueing service was blocking connections from our application services. At 11:15 we decided it would be most helpful to mitigate some impact and put the platform in a partially operational state by disabling the use of the message queueing service entirely. This would lose some key functionality that requires this service such as report generation, routing work to providers, integration updates, and notification delivery. It would, however, gain back a majority of the platform's operation. This change was made at 11:20 and the team confirmed the ability to load the website where it previously was not. After disabling the use of our message queueing service, it was observed that the message queueing service still listed connections from our application services even though the services were no longer connecting. The team then started work to close these orphaned connections in hopes this would restore the health of the node. Due to the number of connections, the team had to spend time developing a script to perform this bulk connection close operation. This effort encountered complications and wasn’t able to be executed until 12:40, fully completing the connection closing at 13:05. At 13:13 our support staff made our response team aware that with the message queueing disabled, we actually were in a significantly less functional state than we thought as no work order pages were loading. This was unexpected and meant the team was operating off an assumption the impact was more mitigated than it actually was. After closing connections we re-enabled the use of the message queueing service at 13:37, suspecting we’d resolved the issue with the bad node. Upon re-enabling we, unfortunately, saw early indicators the message service was still not functioning properly. We then re-disabled its use. At 14:20 we decided to restart a node in the three-node cluster. We then once again tried re-enabling the use of the message queueing service at 14:25. This now operated correctly and we resolved the incident at 14:49 confident the issue was resolved. #### Future Prevention and Process Improvement After the restoration of services, the team worked to identify the root cause of the issue. While we’ve spent a lot of time researching metrics of the message queueing service around the time the incident started, we’ve so far been unable to determine a correlation to any other metric or event. Our team is continuing to research, but believe this may be a fluke occurrence on a single node of the cluster. We have established additional monitoring alerts that can clue us in earlier to signs a possible node degradation. Ultimately, when reviewing this incident, the final action taken to gain resolution is one that we realize could have been attempted much earlier. For quicker action, we are establishing standard operating procedures for safely dealing with an unhealthy node in the cluster. Had we had the confidence to take this action sooner, we would have significantly cut down on the length of this outage.
Status: Postmortem
Impact: Critical | Started At: Jan. 9, 2023, 5:01 p.m.
Description: A connection issue was found between our platform and a primary database. The issue with the connection has been restored and the issue is now resolved.
Status: Resolved
Impact: Critical | Started At: Dec. 28, 2022, 5:20 a.m.
Description: The replication lag event has recovered.
Status: Resolved
Impact: Minor | Started At: Oct. 26, 2022, 7:59 p.m.
Description: All data jobs have caught up and updates should process normally.
Status: Resolved
Impact: Minor | Started At: Oct. 26, 2022, 6:01 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.