Last checked: 7 minutes ago
Get notified about any outages, downtime or incidents for Files.com and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Files.com.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Australia Region | Active |
Background Jobs, including Sync and Webhooks | Active |
Canada Region | Active |
Core Services / API | Active |
EU (Germany) Region | Active |
Files Tools | Active |
FTP/FTPS | Active |
Japan Region | Active |
Remote Server Integrations (Sync and Mount) | Active |
SFTP | Active |
Singapore Region | Active |
UK Region | Active |
USA Region | Active |
WebDAV | Active |
Web Interface | Active |
View the latest incidents for Files.com and check for official updates:
Description: On May 12th, 2023, at AM/PM PST, [Files.com](http://Files.com) received customer reports of elevated DNS errors which resulted in an incident being declared. The Incident Management Team \(IMT\) convened and immediately began investigation. [Files.com](http://Files.com) released an initial Status Page posting on May 12th, 2023, at 3:45 PM PST stating: _**“Reports of Elevated DNS Errors:** We are investigating reports of DNS errors on the_ [_Files.com_](http://Files.com) _service._ _This is intermittently affecting some logins for all services._ _We will provide additional details as they become available. Customers with urgent questions are encouraged to contact our Customer Support team by email. Thank you for your patience.”_ The elevated DNS errors was resolved on May 12th, 2023, at 4:16 PM PST, returning the platform to full functionality. [Files.com](http://Files.com) released a resolution Status Page posting on May 12th, 2023, at 4:23 PM PST stating _“All services have been restored and are operating normally._ _We resolved a DNS issue resulting in some intermittent errors on accessing_ [_Files.com_](http://Files.com) _sites. Users without the site name cached were potentially affected from approximately 2:25 p.m. PST to 4:16 p.m. PST. This issue did not anyone with dedicated IP addresses._ _We will follow up with an Incident Report within ten \(10\) business days including the root cause and steps taken to address the root cause. If you need additional support, please do not hesitate to contact our Customer Support team by email or phone. Thanks for your support while we resolved this issue.”_ This incident occurred during the deployment of changes to our corporate domain registrations as part of the post-mortem/resolution process for the incident that occurred on May 5. As discussed in the RCA for that incident, we moved the registration records for all domain names owned by [Files.com](http://Files.com) to CSC Domains, an enterprise and security-focused domain name registrar, for the purpose of mitigating domain name registrar risk. During the process of the domain transfer, the nameservers for one of our domain names were inadvertently entered incorrectly into the new registrar. As a result, DNS lookups for certain domains resulted in failure. This issue only affected a subset of our customers, and did not affect any customers using custom domain names or custom IP addresses. Once we diagnosed the problem, we were able to call CSC Domains and get the matter resolved immediately. As of now, all domains owned by [Files.com](http://Files.com) are managed by CSC Domains, and we do not expect any further registrar-related incidents to occur in the future. We greatly appreciate your patience and understanding as we resolved this issue. If you need additional assistance or continue to experience issues, please contact our Customer Support team.
Status: Postmortem
Impact: None | Started At: May 12, 2023, 10:45 p.m.
Description: On May 8th, an d May 9th, 2023, [Files.com](http://Files.com) received multiple automated alerts and customer reports of intermittent issues with the [Files.com](http://Files.com) platform, which resulted in an incident being declared. The Incident Management Team \(IMT\) convened and immediately began investigation. [Files.com](http://Files.com) released an initial Status Page posting on May 8th, 2023, at 5:12 PM PST stating: _**“SFTP, FTP/FTPS, WebDAV Service Degraded:** FTP/FTPS, SFTP, WebDAV only: We are investigating elevated error rates on these services on_ [_Files.com_](http://Files.com) _in all regions._ _This incident does not impact other network services such as API, AS2, and others._ _We will provide additional details as they become available. Customers with urgent questions are encouraged to contact our Customer Support team by email. Thank you for your patience.”_ [Files.com](http://Files.com) released a resolution Status Page posting on May 8th, 2023, at 5:37 PM PST stating _“All services have been restored and are operating normally._ _Users connecting to accounts with a custom namespace, an ExaVault host key, a custom host key, or an enforced IP whitelist experienced authentication errors. Logins were impacted between 1:34 p.m. PST and 5:33 p.m. PST. Other users may have experienced elevated error rates as well._ _We will follow up with an Incident Report within ten \(10\) business days including the root cause and steps taken to address the root cause. If you need additional support, please do not hesitate to contact our Customer Support team by email or phone. Thanks for your support while we resolved this issue.”_ Customers continued reporting other intermittent issues with the platform, which resulted in second incident being declared on May 9th, 2023, at 6:47 AM PST. The IMT convened and immediately began investigation The intermittent issues with the [Files.com](http://Files.com) platform were resolved on May 9th, 2023, at 8:07 AM PST, returning the platform to full functionality. This incident occurred due to a complex set of circumstances with times that vary by region. This narrative will focus on the overall story of what happened. On May 5, [Files.com](http://Files.com) experienced an incident that resulted in a 3\+ hour service outage. Prior to that, on May 3, [Files.com](http://Files.com) conducted a successful upgrade of certain regional proxy servers in certain regions from Intel architecture to ARM architecture as part of our overall transition from Intel to ARM across all of our services. As we explained in the RCA of the May 5 incident, our Incident Management Team originally misidentified the root cause of that incident as being related to the new ARM servers and made the decision to roll back from our new ARM servers to the old Intel servers in certain regions on May 5. Unfortunately, that rollback was not correctly performed. We make use AWS \(Amazon Web Services\) EC2 \(Elastic Compute Cloud\) for all of our compute resources on [Files.com](http://Files.com). Both the Intel and ARM servers being discussed run inside AWS EC2. The EC2 networking backplane suffers from a long-standing bug that we have long been aware of where migrating an IP from one server to another can result in erroneous data reported by EC2 to our instances. In short, if you live migrate an IP on EC2 from one server to another, EC2 can report to both servers that they still “own” the IP. Because of this bug, we have a complicated procedure for migrating IPs from one server to another. This procedure is highly automated and provides that we always fully shut down servers after IPs are moved off of them. This procedure works around the EC2 bug. When we performed the rollback from ARM to Intel servers on May 5, we failed to fully follow our procedure and fully shut down the ARM servers. They were “disabled” using a softer disabling mechanism, but at some point they rebooted and once they rebooted, EC2 began to report conflicting information about which server “owned” the IPs related to this incident. In our architecture, servers report their internal and external IP list to our central routing system on a regular schedule. As a result of the two sets of servers reporting conflicting information, our routing systems began to oscillate routing traffic between the Intel and ARM servers every few minutes, and only one set of servers would work at a given time. The root cause of this incident was our failure to follow our own procedure during the transition between ARM and Intel servers. A major contributing factor was our failure to detect a situation where IP addresses appear to oscillate between multiple servers. Another contributing factor is the AWS EC2 bug that results in incorrect IP address information being reported to instances. As a result of this incident, we have conducted remedial training with all of our Infrastructure team to re-train them on the procedure to migrate IPs from one server to another. We have additionally added new protection to our routing system that will detect a situation where IP addresses oscillate between servers and raise an alarm when that happens in the future. Furthermore, we have improved our internal synthetic monitoring systems with the ability to detect the situation that occurred during this incident and treat it as a failure. On a more general note, we have added a considerable amount of sophistication to our monitoring and routing systems as a result of the several incidents that occurred in May, and we are adding more. These improvements amount to over 5,000 lines of code and we are optimistic that they will reduce the frequency and impact of incidents in the future. We greatly appreciate your patience and understanding as we resolved these issues. If you need additional assistance or continue to experience issues, please contact our Customer Support team.
Status: Postmortem
Impact: Major | Started At: May 9, 2023, 12:12 a.m.
Description: On May 8th, an d May 9th, 2023, [Files.com](http://Files.com) received multiple automated alerts and customer reports of intermittent issues with the [Files.com](http://Files.com) platform, which resulted in an incident being declared. The Incident Management Team \(IMT\) convened and immediately began investigation. [Files.com](http://Files.com) released an initial Status Page posting on May 8th, 2023, at 5:12 PM PST stating: _**“SFTP, FTP/FTPS, WebDAV Service Degraded:** FTP/FTPS, SFTP, WebDAV only: We are investigating elevated error rates on these services on_ [_Files.com_](http://Files.com) _in all regions._ _This incident does not impact other network services such as API, AS2, and others._ _We will provide additional details as they become available. Customers with urgent questions are encouraged to contact our Customer Support team by email. Thank you for your patience.”_ [Files.com](http://Files.com) released a resolution Status Page posting on May 8th, 2023, at 5:37 PM PST stating _“All services have been restored and are operating normally._ _Users connecting to accounts with a custom namespace, an ExaVault host key, a custom host key, or an enforced IP whitelist experienced authentication errors. Logins were impacted between 1:34 p.m. PST and 5:33 p.m. PST. Other users may have experienced elevated error rates as well._ _We will follow up with an Incident Report within ten \(10\) business days including the root cause and steps taken to address the root cause. If you need additional support, please do not hesitate to contact our Customer Support team by email or phone. Thanks for your support while we resolved this issue.”_ Customers continued reporting other intermittent issues with the platform, which resulted in second incident being declared on May 9th, 2023, at 6:47 AM PST. The IMT convened and immediately began investigation The intermittent issues with the [Files.com](http://Files.com) platform were resolved on May 9th, 2023, at 8:07 AM PST, returning the platform to full functionality. This incident occurred due to a complex set of circumstances with times that vary by region. This narrative will focus on the overall story of what happened. On May 5, [Files.com](http://Files.com) experienced an incident that resulted in a 3\+ hour service outage. Prior to that, on May 3, [Files.com](http://Files.com) conducted a successful upgrade of certain regional proxy servers in certain regions from Intel architecture to ARM architecture as part of our overall transition from Intel to ARM across all of our services. As we explained in the RCA of the May 5 incident, our Incident Management Team originally misidentified the root cause of that incident as being related to the new ARM servers and made the decision to roll back from our new ARM servers to the old Intel servers in certain regions on May 5. Unfortunately, that rollback was not correctly performed. We make use AWS \(Amazon Web Services\) EC2 \(Elastic Compute Cloud\) for all of our compute resources on [Files.com](http://Files.com). Both the Intel and ARM servers being discussed run inside AWS EC2. The EC2 networking backplane suffers from a long-standing bug that we have long been aware of where migrating an IP from one server to another can result in erroneous data reported by EC2 to our instances. In short, if you live migrate an IP on EC2 from one server to another, EC2 can report to both servers that they still “own” the IP. Because of this bug, we have a complicated procedure for migrating IPs from one server to another. This procedure is highly automated and provides that we always fully shut down servers after IPs are moved off of them. This procedure works around the EC2 bug. When we performed the rollback from ARM to Intel servers on May 5, we failed to fully follow our procedure and fully shut down the ARM servers. They were “disabled” using a softer disabling mechanism, but at some point they rebooted and once they rebooted, EC2 began to report conflicting information about which server “owned” the IPs related to this incident. In our architecture, servers report their internal and external IP list to our central routing system on a regular schedule. As a result of the two sets of servers reporting conflicting information, our routing systems began to oscillate routing traffic between the Intel and ARM servers every few minutes, and only one set of servers would work at a given time. The root cause of this incident was our failure to follow our own procedure during the transition between ARM and Intel servers. A major contributing factor was our failure to detect a situation where IP addresses appear to oscillate between multiple servers. Another contributing factor is the AWS EC2 bug that results in incorrect IP address information being reported to instances. As a result of this incident, we have conducted remedial training with all of our Infrastructure team to re-train them on the procedure to migrate IPs from one server to another. We have additionally added new protection to our routing system that will detect a situation where IP addresses oscillate between servers and raise an alarm when that happens in the future. Furthermore, we have improved our internal synthetic monitoring systems with the ability to detect the situation that occurred during this incident and treat it as a failure. On a more general note, we have added a considerable amount of sophistication to our monitoring and routing systems as a result of the several incidents that occurred in May, and we are adding more. These improvements amount to over 5,000 lines of code and we are optimistic that they will reduce the frequency and impact of incidents in the future. We greatly appreciate your patience and understanding as we resolved these issues. If you need additional assistance or continue to experience issues, please contact our Customer Support team.
Status: Postmortem
Impact: Major | Started At: May 9, 2023, 12:12 a.m.
Description: On May 8th, 2023, at 1:39 PM PST, [Files.com](http://Files.com) received automated alerting of SFTP entirely down in the US East region which resulted in an incident being declared. The Incident Management Team \(IMT\) convened and immediately began investigation. [Files.com](http://Files.com) released an initial Status Page posting on May 8th, 2023, at 1:47 PM PST stating: _**“SFTP Entirely Down – US East Region \(Primary\):** SFTP only: We are investigating a major outage of the SFTP service on_ [_Files.com_](http://Files.com) _in our primary USA region._ _This incident does not impact other network services such as API, FTP, WebDAV, AS2, and others._ _If you have an urgent need to access_ [_Files.com_](http://Files.com)_, we recommend using FTP in lieu of SFTP. If you must connect via SFTP, you should be able to immediately connect \(and access your existing files and account\) using the hostname of our Canada region, which is_ [_app-ca-central-1.files.com_](http://app-ca-central-1.files.com)_._ _We will provide additional details as they become available. Customers with urgent questions are encouraged to contact our Customer Support team by email. Thank you for your patience.”_ The SFTP entirely down in the US East region was resolved on May 8th, 2023, at 1:47 PM PST, returning the platform to full functionality. [Files.com](http://Files.com) released a resolution Status Page posting on May 8th, 2023, at 1:51 PM PST stating _“All services have been restored and are operating normally._ _We have resolved a major outage of the SFTP service on_ [_Files.com_](http://Files.com) _in our primary USA region. This incident did not impact other network services such as API, FTP, WebDAV, AS2, and others. The SFTP service was down from 1:34 p.m. to 1:47 p.m., with a total downtime of 13 minutes, but only in the primary USA region._ _If you previously moved any workloads to another region in response to this incident, you are cleared to move those regional workloads back to the USA region._ _We will follow up with an Incident Report within ten \(10\) business days including the root cause and steps taken to address the root cause. If you need additional support, please do not hesitate to contact our Customer Support team by email or phone. Thanks for your support while we resolved this issue.”_ This incident occurred during a time period that also contained multiple other incidents, some of which are overlapping. This report focuses specifically on the symptoms described here, but many customers who experienced this incident also experienced one of the other incidents. This incident had two distinct parts and root causes. First, [Files.com](http://Files.com) deployed a change to its SFTP server as part of our overall project to dramatically improve the logging and handling of errors on SFTP. The deployment of that change crashed our SFTP servers in several of our smaller regions due to an “out of memory” condition. Our SFTP server is developed in Java, and anyone familiar with Java can tell you how sensitive Java can be to memory configuration settings. We immediately identified the issue with the Java memory settings and pushed a change to Chef, our infrastructure configuration management system, to tweak the SFTP memory settings and resolve the initial crash. The root cause of this first part was [Files.com](http://Files.com)’s failure to monitoring Java runtime parameters such as memory usage to defend against an out of memory condition. We have added additional monitoring around Java memory usage and are optimistic that this situation will be avoided in the future. One benefit of the [Files.com](http://Files.com) architecture as compared with many of our peers is that on [Files.com](http://Files.com), SFTP is a completely isolated subsystem, so this incident did not impact other network services such as FTP, AS2, WebDAV, or API. Unfortunately, when we deployed the configuration change via Chef, we inadvertently deployed an unrelated configuration change at the same time that had been previously merged but not deployed to the SFTP servers. This is due to the fact that we use one unified Chef repository for server configuration where certain recipes can be shared by different server types. That configuration change introduced an error into the upstream communication with our API, resulting in inability to connect via SFTP for certain customers. After investigating the issue, we were able to identify the bad configuration change and revert it. The root cause of the second part is [Files.com](http://Files.com)’s failure to operate adequate change management procedures to prevent an unintended change from being deployed. Our incident management team was quite disappointed to learn about the chain of events that led to this incident. We have already improved our internal synthetic monitoring systems with the ability to detect the situation that occurred during this incident and alert on it immediately. Additionally, as a result of this incident, we are implementing major changes to our change management procedures designed to prevent this sort of configuration management error from happening again. Those changes are fairly complicated and will require a great deal of internal development. As such, they will likely not be deployed until the middle of Q3. It is our goal to have them implemented before our next SOC 2 Type II observation period \(which runs from Q2-Q3 2023\) and documented in our next SOC 2 Type II report. On a more general note, we have added a considerable amount of sophistication to our monitoring and routing systems as a result of the several incidents that occurred in May, and we are adding more. These improvements amount to over 5,000 lines of code and we are optimistic that they will reduce the frequency and impact of incidents in the future. We hope to share more about the improvements in our next SOC 2 Type II report. We greatly appreciate your patience and understanding as we resolved this issue. If you need additional assistance or continue to experience issues, please contact our Customer Support team.
Status: Postmortem
Impact: Critical | Started At: May 8, 2023, 8:47 p.m.
Description: On May 8th, 2023, at 1:39 PM PST, [Files.com](http://Files.com) received automated alerting of SFTP entirely down in the US East region which resulted in an incident being declared. The Incident Management Team \(IMT\) convened and immediately began investigation. [Files.com](http://Files.com) released an initial Status Page posting on May 8th, 2023, at 1:47 PM PST stating: _**“SFTP Entirely Down – US East Region \(Primary\):** SFTP only: We are investigating a major outage of the SFTP service on_ [_Files.com_](http://Files.com) _in our primary USA region._ _This incident does not impact other network services such as API, FTP, WebDAV, AS2, and others._ _If you have an urgent need to access_ [_Files.com_](http://Files.com)_, we recommend using FTP in lieu of SFTP. If you must connect via SFTP, you should be able to immediately connect \(and access your existing files and account\) using the hostname of our Canada region, which is_ [_app-ca-central-1.files.com_](http://app-ca-central-1.files.com)_._ _We will provide additional details as they become available. Customers with urgent questions are encouraged to contact our Customer Support team by email. Thank you for your patience.”_ The SFTP entirely down in the US East region was resolved on May 8th, 2023, at 1:47 PM PST, returning the platform to full functionality. [Files.com](http://Files.com) released a resolution Status Page posting on May 8th, 2023, at 1:51 PM PST stating _“All services have been restored and are operating normally._ _We have resolved a major outage of the SFTP service on_ [_Files.com_](http://Files.com) _in our primary USA region. This incident did not impact other network services such as API, FTP, WebDAV, AS2, and others. The SFTP service was down from 1:34 p.m. to 1:47 p.m., with a total downtime of 13 minutes, but only in the primary USA region._ _If you previously moved any workloads to another region in response to this incident, you are cleared to move those regional workloads back to the USA region._ _We will follow up with an Incident Report within ten \(10\) business days including the root cause and steps taken to address the root cause. If you need additional support, please do not hesitate to contact our Customer Support team by email or phone. Thanks for your support while we resolved this issue.”_ This incident occurred during a time period that also contained multiple other incidents, some of which are overlapping. This report focuses specifically on the symptoms described here, but many customers who experienced this incident also experienced one of the other incidents. This incident had two distinct parts and root causes. First, [Files.com](http://Files.com) deployed a change to its SFTP server as part of our overall project to dramatically improve the logging and handling of errors on SFTP. The deployment of that change crashed our SFTP servers in several of our smaller regions due to an “out of memory” condition. Our SFTP server is developed in Java, and anyone familiar with Java can tell you how sensitive Java can be to memory configuration settings. We immediately identified the issue with the Java memory settings and pushed a change to Chef, our infrastructure configuration management system, to tweak the SFTP memory settings and resolve the initial crash. The root cause of this first part was [Files.com](http://Files.com)’s failure to monitoring Java runtime parameters such as memory usage to defend against an out of memory condition. We have added additional monitoring around Java memory usage and are optimistic that this situation will be avoided in the future. One benefit of the [Files.com](http://Files.com) architecture as compared with many of our peers is that on [Files.com](http://Files.com), SFTP is a completely isolated subsystem, so this incident did not impact other network services such as FTP, AS2, WebDAV, or API. Unfortunately, when we deployed the configuration change via Chef, we inadvertently deployed an unrelated configuration change at the same time that had been previously merged but not deployed to the SFTP servers. This is due to the fact that we use one unified Chef repository for server configuration where certain recipes can be shared by different server types. That configuration change introduced an error into the upstream communication with our API, resulting in inability to connect via SFTP for certain customers. After investigating the issue, we were able to identify the bad configuration change and revert it. The root cause of the second part is [Files.com](http://Files.com)’s failure to operate adequate change management procedures to prevent an unintended change from being deployed. Our incident management team was quite disappointed to learn about the chain of events that led to this incident. We have already improved our internal synthetic monitoring systems with the ability to detect the situation that occurred during this incident and alert on it immediately. Additionally, as a result of this incident, we are implementing major changes to our change management procedures designed to prevent this sort of configuration management error from happening again. Those changes are fairly complicated and will require a great deal of internal development. As such, they will likely not be deployed until the middle of Q3. It is our goal to have them implemented before our next SOC 2 Type II observation period \(which runs from Q2-Q3 2023\) and documented in our next SOC 2 Type II report. On a more general note, we have added a considerable amount of sophistication to our monitoring and routing systems as a result of the several incidents that occurred in May, and we are adding more. These improvements amount to over 5,000 lines of code and we are optimistic that they will reduce the frequency and impact of incidents in the future. We hope to share more about the improvements in our next SOC 2 Type II report. We greatly appreciate your patience and understanding as we resolved this issue. If you need additional assistance or continue to experience issues, please contact our Customer Support team.
Status: Postmortem
Impact: Critical | Started At: May 8, 2023, 8:47 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.