Castle Status: Check if Castle down or having an outage.

Castle outages and incidents

Outage and incident data over the last 30 days for Castle.

There have been 1 outages or incidents for Castle in the last 30 days.

Severity Breakdown:

None: 0

Minor: 1

Major: 0

Critical: 0

Tired of searching for status updates?

Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!

Components and Services Monitored for Castle

Outlogger tracks the status of these components for Xero:

Dashboard Active

Legacy APIs Active

Log API Active

Risk and Filter APIs Active

Component	Status
Dashboard	Active
Legacy APIs	Active
Log API	Active
Risk and Filter APIs	Active

Latest Castle outages and incidents.

View the latest incidents for Castle and check for official updates:

API Downtime

Description: At 2021-09-06 20:04 UTC we experienced an AWS hardware failure with one of our main databases which led to 7 minutes of downtime impacting our APIs. During this time, the APIs were returning a 500 response code and no data was processed. The database in question is configured to be multi-node with automatic failover, but for unknown reasons the failover didn't happen as expected when the hardware fault occurred. Instead, a full backup had to be recreated, which led to the extended period of downtime. We're currently debugging this with AWS support to make sure we can trust the resiliency of our platform. While the current setup should provide good redundancy, we're simultaneously looking into alternative options to prevent this from happening again.

Status: Resolved

Impact: None | Started At: Sept. 6, 2021, 8:04 p.m.

Updates:

Time: Sept. 9, 2021, 2:43 p.m.

Status: Resolved

Update: At 2021-09-06 20:04 UTC we experienced an AWS hardware failure with one of our main databases which led to 7 minutes of downtime impacting our APIs. During this time, the APIs were returning a 500 response code and no data was processed. The database in question is configured to be multi-node with automatic failover, but for unknown reasons the failover didn't happen as expected when the hardware fault occurred. Instead, a full backup had to be recreated, which led to the extended period of downtime. We're currently debugging this with AWS support to make sure we can trust the resiliency of our platform. While the current setup should provide good redundancy, we're simultaneously looking into alternative options to prevent this from happening again.

Lost events in Castle Dashboard

Description: Between 14:43 and 15:31 UTC Castle experienced an infrastructure issue with our message queuing system that caused some customer event data to get lost. While risk scoring and inline responses were functioning normally, the requests sent during the period of the incident will not be visible or searchable in the Castle Dashboard We're prioritizing efforts to add extra redundancy to our system to prevent this from happening again.

Status: Resolved

Impact: None | Started At: Aug. 30, 2021, 2:43 p.m.

Updates:

Time: Aug. 31, 2021, 4:08 p.m.

Status: Resolved

Update: Between 14:43 and 15:31 UTC Castle experienced an infrastructure issue with our message queuing system that caused some customer event data to get lost. While risk scoring and inline responses were functioning normally, the requests sent during the period of the incident will not be visible or searchable in the Castle Dashboard We're prioritizing efforts to add extra redundancy to our system to prevent this from happening again.

Service disruption

Description: On Sunday, April 4th, 2021, beginning at 13:56 UTC, Castle's `/authenticate` endpoint was unavailable. Our teams promptly responded and service was restored at 14:09 UTC. We've conducted a full retrospective and root-cause analysis and determined that the original cause of the incident was the hardware failure \(as confirmed by AWS Support\) of an AWS host instance that contained Castle's managed cache service. This hardware failure caused an accumulation of timeouts, resulting in some app instances being marked unhealthy and automatically restarted in a loop. Although rare, we do expect occasional hardware-level failures, and our system is designed to be resilient to these failures whenever possible. In this case, the accumulated timeouts caused the system to behave in a way we have not seen before. We have re-prioritized our engineering team to implement '[circuit breaker](https://martinfowler.com/bliki/CircuitBreaker.html)'-style handling around cache look-ups which will prevent subsequent cache layer failures from impacting synchronous endpoints like `/authenticate`.

Status: Postmortem

Impact: Minor | Started At: April 4, 2021, 1:56 p.m.

Updates:

Time: April 8, 2021, 2:28 p.m.

Status: Postmortem

Update: On Sunday, April 4th, 2021, beginning at 13:56 UTC, Castle's `/authenticate` endpoint was unavailable. Our teams promptly responded and service was restored at 14:09 UTC. We've conducted a full retrospective and root-cause analysis and determined that the original cause of the incident was the hardware failure \(as confirmed by AWS Support\) of an AWS host instance that contained Castle's managed cache service. This hardware failure caused an accumulation of timeouts, resulting in some app instances being marked unhealthy and automatically restarted in a loop. Although rare, we do expect occasional hardware-level failures, and our system is designed to be resilient to these failures whenever possible. In this case, the accumulated timeouts caused the system to behave in a way we have not seen before. We have re-prioritized our engineering team to implement '[circuit breaker](https://martinfowler.com/bliki/CircuitBreaker.html)'-style handling around cache look-ups which will prevent subsequent cache layer failures from impacting synchronous endpoints like `/authenticate`.
Time: April 4, 2021, 2:26 p.m.

Status: Resolved

Update: System is back to normal. We will follow up with more details about this incident
Time: April 4, 2021, 2:13 p.m.

Status: Investigating

Update: API endpoints responding normally again. Queued requests are catching up. Monitoring
Time: April 4, 2021, 2:06 p.m.

Status: Investigating

Update: We’re experiencing timeouts in API endpoints. Investigating

Service disruption

Description: On March 30, 2021, Castle’s API became degraded during three distinct windows of time: * 12:02 UTC - 12:45 UTC * 12:59 UTC - 13:41 UTC * 14:48 UTC - 15:25 UTC During this time, some Castle API calls failed, including calls to our synchronous `authenticate` endpoint. The Castle dashboard was up, but due to the API being unavailable was not rendering data. Service was fully restored as of 15:25 UTC, and some data generated from requests to our asynchronous `track` and `batch` endpoints during the incident was recovered from queues and subsequently processed. As we communicated to all active customers yesterday, we take these sort of incidents very seriously, and want to share some of the factors that led to this incident. The root cause of the incident was a failure of one of our primary data clusters. This is a multi-node, fault-tolerant commercial solution and a complete cluster failure is extremely rare. Castle’s infrastructure team responded immediately to the incident and found an unbounded memory leak that caused each node to simultaneously shut down. Over the course of the incident, we learned this memory leak was exacerbated by a specific class of background job that actually began running a day prior but did not begin leaking memory for some time. When the incident began, we detected the issue and immediately restarted the cluster. A full 'cold start' of the entire cluster takes around 40 minutes, and this accounts for the first downtime window. After the cluster restarted, our fault-tolerant job scheduling system attempted to run the jobs again, which caused the cluster to require full cold restarts twice more as we worked to clear out the job queue and replicas. At this time, we believe the reason for the memory leak is a bug in our data cluster provider’s software - we have been able to successfully reproduce the issue in a test environment and have a high-priority case open with their support team. In the meantime, we have audited all active background job systems to ensure performance-affecting jobs are temporarily disabled or worked around. Once again, we apologize for the impact of this interruption. Please feel free to contact us at [[email protected]](mailto:[email protected]) if you have any further questions.

Status: Postmortem

Impact: Critical | Started At: March 30, 2021, 12:09 p.m.

Updates:

Time: March 30, 2021, 5:06 p.m.

Status: Postmortem

Update: On March 30, 2021, Castle’s API became degraded during three distinct windows of time: * 12:02 UTC - 12:45 UTC * 12:59 UTC - 13:41 UTC * 14:48 UTC - 15:25 UTC During this time, some Castle API calls failed, including calls to our synchronous `authenticate` endpoint. The Castle dashboard was up, but due to the API being unavailable was not rendering data. Service was fully restored as of 15:25 UTC, and some data generated from requests to our asynchronous `track` and `batch` endpoints during the incident was recovered from queues and subsequently processed. As we communicated to all active customers yesterday, we take these sort of incidents very seriously, and want to share some of the factors that led to this incident. The root cause of the incident was a failure of one of our primary data clusters. This is a multi-node, fault-tolerant commercial solution and a complete cluster failure is extremely rare. Castle’s infrastructure team responded immediately to the incident and found an unbounded memory leak that caused each node to simultaneously shut down. Over the course of the incident, we learned this memory leak was exacerbated by a specific class of background job that actually began running a day prior but did not begin leaking memory for some time. When the incident began, we detected the issue and immediately restarted the cluster. A full 'cold start' of the entire cluster takes around 40 minutes, and this accounts for the first downtime window. After the cluster restarted, our fault-tolerant job scheduling system attempted to run the jobs again, which caused the cluster to require full cold restarts twice more as we worked to clear out the job queue and replicas. At this time, we believe the reason for the memory leak is a bug in our data cluster provider’s software - we have been able to successfully reproduce the issue in a test environment and have a high-priority case open with their support team. In the meantime, we have audited all active background job systems to ensure performance-affecting jobs are temporarily disabled or worked around. Once again, we apologize for the impact of this interruption. Please feel free to contact us at [[email protected]](mailto:[email protected]) if you have any further questions.
Time: March 30, 2021, 4:28 p.m.

Status: Resolved

Update: Systems are operating normally and we have put mitigation measures in place to ensure the issue does not reoccur. We'll have a full retrospective and root cause teardown of the incident published within the next few days.
Time: March 30, 2021, 3:47 p.m.

Status: Monitoring

Update: API endpoints are responsive again and the system is stabilizing. We're monitoring the situation
Time: March 30, 2021, 2:53 p.m.

Status: Identified

Update: We are seeing degraded performance on API endpoints once more, and are working on restoring functionality as quickly as possible.
Time: March 30, 2021, 2:09 p.m.

Status: Monitoring

Update: Database cluster operating normally and API endpoints are responding. We're continuing to monitor the situation
Time: March 30, 2021, 1:18 p.m.

Status: Investigating

Update: We are experiencing issues with our main database cluster which affects all API endpoints. We're currently investigating this issue.

Check the status of similar companies and alternatives to Castle

NetSuite

Systems Active

ZoomInfo

Systems Active

SPS Commerce

Systems Active

Miro

Systems Active

Field Nation

Systems Active

Outreach

Systems Active

Own Company

Systems Active

Mindbody

Systems Active

TaskRabbit

Systems Active

Nextiva

Systems Active

6Sense

Systems Active

BigCommerce

Systems Active

Frequently Asked Questions - Castle

Is there a Castle outage?

The current status of Castle is: Systems Active

Where can I find the official status page of Castle?

The official status page for Castle is here

How can I get notified if Castle is down or experiencing an outage?

To get notified of any status changes to Castle, simply sign up to OutLogger's free monitoring service. OutLogger checks the official status of Castle every few minutes and will notify you of any changes. You can veiw the status of all your cloud vendors in one dashboard. Sign up here

What does Castle do?

A service that efficiently addresses fraud rings, account takeovers, and suspicious transactions in a timely manner.

Is there an Castle outage?

Castle status: Systems Active

Castle outages and incidents

There have been 1 outages or incidents for Castle in the last 30 days.

Severity Breakdown:

Tired of searching for status updates?

Components and Services Monitored for Castle

Latest Castle outages and incidents.

API Downtime

Updates:

Lost events in Castle Dashboard

Updates:

Service disruption

Updates:

Service disruption

Updates:

Check the status of similar companies and alternatives to Castle

NetSuite

ZoomInfo

SPS Commerce

Miro

Field Nation

Outreach

Own Company

Mindbody

TaskRabbit

Nextiva

6Sense

BigCommerce

Frequently Asked Questions - Castle

Is there a Castle outage?

Where can I find the official status page of Castle?

How can I get notified if Castle is down or experiencing an outage?

What does Castle do?

Start monitoring now!