Field Nation Status: Check if Field Nation down or having an outage.

Field Nation outages and incidents

Outage and incident data over the last 30 days for Field Nation.

There have been 1 outages or incidents for Field Nation in the last 30 days.

Severity Breakdown:

None: 1

Minor: 0

Major: 0

Critical: 0

Tired of searching for status updates?

Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!

Components and Services Monitored for Field Nation

Outlogger tracks the status of these components for Xero:

API Active

Marketing Website Active

Mobile App Active

Out of the box Integration connectors Active

Web App Active

Component	Status
API	Active
Marketing Website	Active
Mobile App	Active
Out of the box Integration connectors	Active
Web App	Active

Latest Field Nation outages and incidents.

View the latest incidents for Field Nation and check for official updates:

FN Marketplace Degraded Performance

Description: This incident has been resolved.

Status: Resolved

Impact: None | Started At: Jan. 17, 2023, 9:58 p.m.

Updates:

Time: Jan. 17, 2023, 11:28 p.m.

Status: Resolved

Update: This incident has been resolved.
Time: Jan. 17, 2023, 11:06 p.m.

Status: Monitoring

Update: We've tracked down the performance issue and have mitigated the problem. We're working on a longer term solution to remediate the underlying issue. We will continue to monitor the platform to ensure our current solution is working as expected.
Time: Jan. 17, 2023, 10:33 p.m.

Status: Identified

Update: We've identified a performance issue with some queries against our core database cluster. We're currently working to address this performance.
Time: Jan. 17, 2023, 9:58 p.m.

Status: Investigating

Update: We have received reports and are observing degraded performance and intermittent request failures in the platform. We're currently investigating the issue.

FN Marketplace unavailable to Users

Description: #### Summary A failure of a central message queueing service, responsible for delivering event messages across application services that make up our system, caused a widespread outage of the Field Nation platform. This outage extended a significant amount of time at 3.5 hours. While our team believed to have mitigated the extent of the impact, they learned a significant portion of time into the incident that the mitigation was less effective than initially believed. We’re sincerely sorry for the disruption and the length of the duration. We understand the importance Field Nation plays for our service providers and buyers in how they get work done, and work hard to ensure technical issues don’t get in the way of that. The incident was caused by a single degraded node. We are implementing additional monitors to more quickly identify such a problem in the future, and also improving our internal operating documentation to give better guidance for much quicker resolution to this nature of system failure in the future. #### What Happened On Jan 9th at 09:17 we were alerted to a health issue with one of the three nodes that make up our central message queueing service. The alert notified us that it did not reply to an automated health check and was likely down. This is a service that enables application services that make up our platform to be aware of events occurring within the platform as well as queue work to be background processed. Upon our team’s investigation into this alert, they were unable to determine an issue and the node appeared to be functioning with healthy metrics. The alert was resolved an hour later and the team assumed it to be a false alarm. At 10:30 a routine deployment was made for a minor update to one of our application services. 15 minutes later at 10:45, our team received an alarm about high memory usage on the same message queueing service node that earlier had the failed health check. The team then observed some key application services were reporting as unhealthy and the platform website was no longer loading. The team decided at 11:00 to roll back the changes deployed at 10:30. Although there was nothing in the changes that related to the issue, the timing of the change appeared to correlate. After rolling back those changes there was no sign of improvement. It was then observed the message queueing service was blocking connections from our application services. At 11:15 we decided it would be most helpful to mitigate some impact and put the platform in a partially operational state by disabling the use of the message queueing service entirely. This would lose some key functionality that requires this service such as report generation, routing work to providers, integration updates, and notification delivery. It would, however, gain back a majority of the platform's operation. This change was made at 11:20 and the team confirmed the ability to load the website where it previously was not. After disabling the use of our message queueing service, it was observed that the message queueing service still listed connections from our application services even though the services were no longer connecting. The team then started work to close these orphaned connections in hopes this would restore the health of the node. Due to the number of connections, the team had to spend time developing a script to perform this bulk connection close operation. This effort encountered complications and wasn’t able to be executed until 12:40, fully completing the connection closing at 13:05. At 13:13 our support staff made our response team aware that with the message queueing disabled, we actually were in a significantly less functional state than we thought as no work order pages were loading. This was unexpected and meant the team was operating off an assumption the impact was more mitigated than it actually was. After closing connections we re-enabled the use of the message queueing service at 13:37, suspecting we’d resolved the issue with the bad node. Upon re-enabling we, unfortunately, saw early indicators the message service was still not functioning properly. We then re-disabled its use. At 14:20 we decided to restart a node in the three-node cluster. We then once again tried re-enabling the use of the message queueing service at 14:25. This now operated correctly and we resolved the incident at 14:49 confident the issue was resolved. #### Future Prevention and Process Improvement After the restoration of services, the team worked to identify the root cause of the issue. While we’ve spent a lot of time researching metrics of the message queueing service around the time the incident started, we’ve so far been unable to determine a correlation to any other metric or event. Our team is continuing to research, but believe this may be a fluke occurrence on a single node of the cluster. We have established additional monitoring alerts that can clue us in earlier to signs a possible node degradation. Ultimately, when reviewing this incident, the final action taken to gain resolution is one that we realize could have been attempted much earlier. For quicker action, we are establishing standard operating procedures for safely dealing with an unhealthy node in the cluster. Had we had the confidence to take this action sooner, we would have significantly cut down on the length of this outage.

Status: Postmortem

Impact: Critical | Started At: Jan. 9, 2023, 5:01 p.m.

Updates:

Time: Jan. 12, 2023, 4:15 p.m.

Status: Postmortem

Update: #### Summary A failure of a central message queueing service, responsible for delivering event messages across application services that make up our system, caused a widespread outage of the Field Nation platform. This outage extended a significant amount of time at 3.5 hours. While our team believed to have mitigated the extent of the impact, they learned a significant portion of time into the incident that the mitigation was less effective than initially believed. We’re sincerely sorry for the disruption and the length of the duration. We understand the importance Field Nation plays for our service providers and buyers in how they get work done, and work hard to ensure technical issues don’t get in the way of that. The incident was caused by a single degraded node. We are implementing additional monitors to more quickly identify such a problem in the future, and also improving our internal operating documentation to give better guidance for much quicker resolution to this nature of system failure in the future. #### What Happened On Jan 9th at 09:17 we were alerted to a health issue with one of the three nodes that make up our central message queueing service. The alert notified us that it did not reply to an automated health check and was likely down. This is a service that enables application services that make up our platform to be aware of events occurring within the platform as well as queue work to be background processed. Upon our team’s investigation into this alert, they were unable to determine an issue and the node appeared to be functioning with healthy metrics. The alert was resolved an hour later and the team assumed it to be a false alarm. At 10:30 a routine deployment was made for a minor update to one of our application services. 15 minutes later at 10:45, our team received an alarm about high memory usage on the same message queueing service node that earlier had the failed health check. The team then observed some key application services were reporting as unhealthy and the platform website was no longer loading. The team decided at 11:00 to roll back the changes deployed at 10:30. Although there was nothing in the changes that related to the issue, the timing of the change appeared to correlate. After rolling back those changes there was no sign of improvement. It was then observed the message queueing service was blocking connections from our application services. At 11:15 we decided it would be most helpful to mitigate some impact and put the platform in a partially operational state by disabling the use of the message queueing service entirely. This would lose some key functionality that requires this service such as report generation, routing work to providers, integration updates, and notification delivery. It would, however, gain back a majority of the platform's operation. This change was made at 11:20 and the team confirmed the ability to load the website where it previously was not. After disabling the use of our message queueing service, it was observed that the message queueing service still listed connections from our application services even though the services were no longer connecting. The team then started work to close these orphaned connections in hopes this would restore the health of the node. Due to the number of connections, the team had to spend time developing a script to perform this bulk connection close operation. This effort encountered complications and wasn’t able to be executed until 12:40, fully completing the connection closing at 13:05. At 13:13 our support staff made our response team aware that with the message queueing disabled, we actually were in a significantly less functional state than we thought as no work order pages were loading. This was unexpected and meant the team was operating off an assumption the impact was more mitigated than it actually was. After closing connections we re-enabled the use of the message queueing service at 13:37, suspecting we’d resolved the issue with the bad node. Upon re-enabling we, unfortunately, saw early indicators the message service was still not functioning properly. We then re-disabled its use. At 14:20 we decided to restart a node in the three-node cluster. We then once again tried re-enabling the use of the message queueing service at 14:25. This now operated correctly and we resolved the incident at 14:49 confident the issue was resolved. #### Future Prevention and Process Improvement After the restoration of services, the team worked to identify the root cause of the issue. While we’ve spent a lot of time researching metrics of the message queueing service around the time the incident started, we’ve so far been unable to determine a correlation to any other metric or event. Our team is continuing to research, but believe this may be a fluke occurrence on a single node of the cluster. We have established additional monitoring alerts that can clue us in earlier to signs a possible node degradation. Ultimately, when reviewing this incident, the final action taken to gain resolution is one that we realize could have been attempted much earlier. For quicker action, we are establishing standard operating procedures for safely dealing with an unhealthy node in the cluster. Had we had the confidence to take this action sooner, we would have significantly cut down on the length of this outage.
Time: Jan. 9, 2023, 8:49 p.m.

Status: Resolved

Update: This incident has been resolved.
Time: Jan. 9, 2023, 8:39 p.m.

Status: Monitoring

Update: We've been able to get the event messaging system in a healthy state and have been monitoring normal platform behavior. We will continue to monitor but as of now things are once again operational.
Time: Jan. 9, 2023, 8:26 p.m.

Status: Identified

Update: We are continuing to work through the issue. The core problem is an unhealthy event messaging system that our platform relies on. We have two parallel operations underway: 1.) Implement temporary solution to enable more platform functionality without our event messaging system. 2.) Bring the event messaging system to a health state so we can get to complete resolution. We are hopeful we soon will have either full resolution or a more at least more restored functionality.
Time: Jan. 9, 2023, 7:29 p.m.

Status: Identified

Update: We are continuing to work on resolving an issue in a service dependency to restore functionality.
Time: Jan. 9, 2023, 5:58 p.m.

Status: Identified

Update: We've identified some of the impacted areas and are currently working through them to continue restoring functionality to the platform.
Time: Jan. 9, 2023, 5:28 p.m.

Status: Investigating

Update: We're continuing to investigate the issue and are working on deploying some mitigations to address some conditions we've observed.
Time: Jan. 9, 2023, 5:01 p.m.

Status: Investigating

Update: We are currently investigating the issue.

Platform availability issue

Description: A connection issue was found between our platform and a primary database. The issue with the connection has been restored and the issue is now resolved.

Status: Resolved

Impact: Critical | Started At: Dec. 28, 2022, 5:20 a.m.

Updates:

Time: Dec. 28, 2022, 5:26 a.m.

Status: Resolved

Update: A connection issue was found between our platform and a primary database. The issue with the connection has been restored and the issue is now resolved.
Time: Dec. 28, 2022, 5:20 a.m.

Status: Investigating

Update: We are investigating an issue causing the platform to not load correctly.

Delayed updates for data input

Description: The replication lag event has recovered.

Status: Resolved

Impact: Minor | Started At: Oct. 26, 2022, 7:59 p.m.

Updates:

Time: Oct. 26, 2022, 8:17 p.m.

Status: Resolved

Update: The replication lag event has recovered.
Time: Oct. 26, 2022, 7:59 p.m.

Status: Investigating

Update: We've observed slower than usual data replication across our databases. This can cause platform actions to not immediately appear and may lead to an increased number of error rates. We are working to identify the cause of this issue so we can take appropriate actions.

Slow processing of data updates

Description: All data jobs have caught up and updates should process normally.

Status: Resolved

Impact: Minor | Started At: Oct. 26, 2022, 6:01 p.m.

Updates:

Time: Oct. 26, 2022, 7:42 p.m.

Status: Resolved

Update: All data jobs have caught up and updates should process normally.
Time: Oct. 26, 2022, 7:40 p.m.

Status: Monitoring

Update: A fix has been implemented and we are monitoring the results.
Time: Oct. 26, 2022, 6:01 p.m.

Status: Identified

Update: We are experiencing higher than usual data processing loads and that is impacting is slower updates to external systems, We working on a fix to scale and handle the load.

Check the status of similar companies and alternatives to Field Nation

NetSuite

Systems Active

ZoomInfo

Systems Active

SPS Commerce

Systems Active

Miro

Systems Active

Outreach

Systems Active

Own Company

Systems Active

Mindbody

Systems Active

TaskRabbit

Systems Active

Nextiva

Systems Active

6Sense

Systems Active

BigCommerce

Systems Active

WalkMe

Systems Active

Frequently Asked Questions - Field Nation

Is there a Field Nation outage?

The current status of Field Nation is: Systems Active

Where can I find the official status page of Field Nation?

The official status page for Field Nation is here

How can I get notified if Field Nation is down or experiencing an outage?

To get notified of any status changes to Field Nation, simply sign up to OutLogger's free monitoring service. OutLogger checks the official status of Field Nation every few minutes and will notify you of any changes. You can veiw the status of all your cloud vendors in one dashboard. Sign up here

Is there an Field Nation outage?

Field Nation status: Systems Active

Field Nation outages and incidents

There have been 1 outages or incidents for Field Nation in the last 30 days.

Severity Breakdown:

Tired of searching for status updates?

Components and Services Monitored for Field Nation

Latest Field Nation outages and incidents.

FN Marketplace Degraded Performance

Updates:

FN Marketplace unavailable to Users

Updates:

Platform availability issue

Updates:

Delayed updates for data input

Updates:

Slow processing of data updates

Updates:

Check the status of similar companies and alternatives to Field Nation

NetSuite

ZoomInfo

SPS Commerce

Miro

Outreach

Own Company

Mindbody

TaskRabbit

Nextiva

6Sense

BigCommerce

WalkMe

Frequently Asked Questions - Field Nation

Is there a Field Nation outage?

Where can I find the official status page of Field Nation?

How can I get notified if Field Nation is down or experiencing an outage?

Start monitoring now!