Last checked: 5 minutes ago
Get notified about any outages, downtime or incidents for RebelMouse and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for RebelMouse.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
AWS ec2-us-east-1 | Active |
AWS elb-us-east-1 | Active |
AWS RDS | Active |
AWS route53 | Active |
AWS s3-us-standard | Active |
AWS ses-us-east-1 | Active |
Braintree API | Active |
Braintree PayPal Processing | Active |
CDN | Active |
Celery | Active |
Content Delivery API | Active |
Discovery | Active |
EKS Cluster | Active |
Active | |
Fastly Amsterdam (AMS) | Active |
Fastly Hong Kong (HKG) | Active |
Fastly London (LHR) | Active |
Fastly Los Angeles (LAX) | Active |
Fastly New York (JFK) | Active |
Fastly Sydney (SYD) | Active |
Full Platform | Active |
Google Apps Analytics | Active |
Logged In Users | Active |
Media | Active |
Mongo Cluster | Active |
Pharos | Active |
RabbitMQ | Active |
Redis Cluster | Active |
Sentry Dashboard | Active |
Stats | Active |
Talaria | Active |
Active | |
WFE | Active |
View the latest incidents for RebelMouse and check for official updates:
Description: # Chronology of the incident At 16:27 UTC, we detected a significant load on our servers. By 16:43 UTC, it was identified that the CoreDNS server was suffering performance degradation due to the scaling out of applications within our Kubernetes cluster. This situation was further complicated by a performance degradation in our MongoDB database at 16:55 UTC, caused by an excessive open number of connections initiated by the scaling applications. An emergency meeting was escalated at 17:04 UTC and the source of the excessive load to DNS servers was identified at 17:16 UTC. Measures were immediately taken to optimize DNS queries across the Kubernetes cluster by reducing the number of DNS clients, which mainly involved halting non-essential services. These measures led to an initial recovery of performance at 17:30 UTC and subsequently, a fix was developed for the CoreDNS configuration, identified as the root cause of the issues. Unfortunately, at 19:16 UTC, a restart of CoreDNS led to a performance degradation on the editorial clusters and unveiled the unavailability of one of the MongoDB replica set instances. This was attributed to the restart, which caused a cache purge, thus highlighting the magnitude of the MongoDB performance degradation. We identified that this issue with MongoDB had a significant impact on the performance of our CoreDNS systems, further complicating the situation. Recognizing the severity of the situation, we immediately launched a recovery process for the MongoDB replica set. As we progressed with damage control, a preliminary trial was made to reinstate these services. Despite our efforts, the services reactivation led to significant setbacks, notably impacting the overall performance of the editorial web platform. However, it's crucial to note that the websites for end users and crawlers maintained its functionality and continued to operate as expected with no major degradation. To reinforce our commitment to operational stability, we opted to keep the service offline pending a comprehensive investigation and resolution of the underlying issues with the MongoDB database. These measures facilitated a full recovery of the MongoDB system by 21:10 UTC. Post recovery, we continued to monitor the situation over a specified period before cautiously reactivating services signaled the end of the active incident. # The impact of the incident While the website for end-users and crawlers function without meaningful disruption, the incident resulted in partial performance degradation of editorial clusters and non essential services, like automations or javascript runtimes. # The underlying cause The incident was triggered by a combination of factors, including an aggressive web crawler, a surge in cache invalidations due to layout updates, and a suboptimal configuration of CoreDNS. # Actions taken & Preventive Measures Reconfigured CoreDNS setup significantly increasing the service capacity. As a preventive measure we are going to create an update for our inhouse built cache logic to spread the cache revalidation process in time to prevent requests spikes to “origins”
Status: Postmortem
Impact: None | Started At: May 29, 2024, 5:09 p.m.
Description: **Chronology of the incident** * Apr 25, 2024, 05:12 PM UTC: RebelMouse received an alert from internal monitoring systems about a significantly increased error rate. * Apr 25, 2024, 05:12 PM UTC: DevOps team started to check the systems. * Apr 25, 2024, 05:23 PM UTC: RebelMouse published the status portal about performance degradation. * Apr 25, 2024, 05:26 PM UTC: The problem was identified as an overload of Talaria \(Smart Cache Service\). * Apr 25, 2024, 05:42 PM UTC: Traffic was rerouted bypassing the Talaria. This action restored the performance for the end users. * Apr 25, 2024, 06:00 PM UTC: Changes in the configuration were applied to increase the resources for Talaria. * Apr 25, 2024, 06:09 PM UTC: Talaria was re-enabled. * Apr 26, 2024, 01:06 PM UTC: Incident was marked as resolved **The impact of the incident** The incident resulted in performance degradation, leading to periods of unavailability for public pages or delays in publishing the content. **The underlying cause** Increased amount of traffic caused the overload of the Talaria. **Actions taken & Preventive Measures** We've reviewed the configuration of the Talaria service, added additional resources to it and optimized the autoscaling rules. Our autoscaling system operates on preset rules designed to accommodate anticipated loads. However, as traffic patterns shift over time, it's essential to periodically review and adjust these rules accordingly.
Status: Postmortem
Impact: Minor | Started At: April 25, 2024, 5:23 p.m.
Description: ## Chronology of the incident: * Mar 27, 2024, 01:02 PM UTC RebelMouse received an alert from internal monitoring systems about slight increased error rate * Mar 27, 2024, 01:07 PM UTC DevOps team checked the systems, noticed a short load spike and error rate already went to normal * Mar 27, 2024, 02:12 PM UTC RebelMouse received an alert from internal monitoring systems about slight increased error rate * Mar 27, 2024, 02:14 PM UTC RebelMouse team members observed a degradation in performance across certain services and promptly reported an incident. However, this degradation in performance was temporary and not permanent. * Mar 27, 2024, 02:22 PM UTC A dedicated incident resolution team was assembled, initiating an investigation * Mar 27, 2024, 02:44 PM UTC Significant anomalies in traffic were identified, prompting the allocation of extra resources to the cluster tasked with handling said traffic. * Mar 27, 2024, 02:53 PM UTC The incident resolution team transitioned into monitoring mode. * Mar 27, 2024, 04:27 PM UTC RebelMouse received an alert from internal monitoring systems about significant increased error rate * Mar 27, 2024, 04:30 PM UTC The incident resolution team decided to fully reroute the suspicious traffic to an independent cluster * Mar 27, 2024, 04:57 PM UTC The suspicious traffic was isolated in the independent cluster * Mar 27, 2024, 05:03 PM UTC RebelMouse published the status portal message * Mar 27, 2024, 05:04 PM UTC The incident resolution team shifted into monitoring mode and concurrently began exploring potential enhancements in case of any recurrence of the issue. * Mar 27, 2024, 05:48 PM UTC RebelMouse received reports from the clients about the degradation in performance and also alerts from monitoring systems. * Mar 27, 2024, 05:58 PM UTC The root cause of the problem was identified * Mar 27, 2024, 06:05 PM UTC The fix was implemented * Mar 28, 2024 An independent cluster was established specifically for editorial traffic to safeguard its functionality from potential disruptions caused by other services ## The impact of the incident The incident resulted in intermittent performance degradation, leading to periods of unavailability for editorial tools. ## The underlying cause if known The `Broken Links` service shared endpoints with critical editorial tools such as the `Entry Editor` or `Posts Dashboard`. Periodically, this service generated long-running requests, causing health checks to fail and Kubernetes to deem the pods unhealthy. Consequently, Kubernetes terminated these pods and initiated their recreation. This process resulted in temporary unavailability of the affected services during the restart. ## Actions taken & Preventive Measures An independent cluster was established specifically for editorial traffic to safeguard its functionality from potential disruptions caused by other services
Status: Postmortem
Impact: Minor | Started At: March 27, 2024, 5:02 p.m.
Description: ## Chronology of the incident: * Mar 27, 2024, 01:02 PM UTC RebelMouse received an alert from internal monitoring systems about slight increased error rate * Mar 27, 2024, 01:07 PM UTC DevOps team checked the systems, noticed a short load spike and error rate already went to normal * Mar 27, 2024, 02:12 PM UTC RebelMouse received an alert from internal monitoring systems about slight increased error rate * Mar 27, 2024, 02:14 PM UTC RebelMouse team members observed a degradation in performance across certain services and promptly reported an incident. However, this degradation in performance was temporary and not permanent. * Mar 27, 2024, 02:22 PM UTC A dedicated incident resolution team was assembled, initiating an investigation * Mar 27, 2024, 02:44 PM UTC Significant anomalies in traffic were identified, prompting the allocation of extra resources to the cluster tasked with handling said traffic. * Mar 27, 2024, 02:53 PM UTC The incident resolution team transitioned into monitoring mode. * Mar 27, 2024, 04:27 PM UTC RebelMouse received an alert from internal monitoring systems about significant increased error rate * Mar 27, 2024, 04:30 PM UTC The incident resolution team decided to fully reroute the suspicious traffic to an independent cluster * Mar 27, 2024, 04:57 PM UTC The suspicious traffic was isolated in the independent cluster * Mar 27, 2024, 05:03 PM UTC RebelMouse published the status portal message * Mar 27, 2024, 05:04 PM UTC The incident resolution team shifted into monitoring mode and concurrently began exploring potential enhancements in case of any recurrence of the issue. * Mar 27, 2024, 05:48 PM UTC RebelMouse received reports from the clients about the degradation in performance and also alerts from monitoring systems. * Mar 27, 2024, 05:58 PM UTC The root cause of the problem was identified * Mar 27, 2024, 06:05 PM UTC The fix was implemented * Mar 28, 2024 An independent cluster was established specifically for editorial traffic to safeguard its functionality from potential disruptions caused by other services ## The impact of the incident The incident resulted in intermittent performance degradation, leading to periods of unavailability for editorial tools. ## The underlying cause if known The `Broken Links` service shared endpoints with critical editorial tools such as the `Entry Editor` or `Posts Dashboard`. Periodically, this service generated long-running requests, causing health checks to fail and Kubernetes to deem the pods unhealthy. Consequently, Kubernetes terminated these pods and initiated their recreation. This process resulted in temporary unavailability of the affected services during the restart. ## Actions taken & Preventive Measures An independent cluster was established specifically for editorial traffic to safeguard its functionality from potential disruptions caused by other services
Status: Postmortem
Impact: Minor | Started At: March 27, 2024, 5:02 p.m.
Description: ## **Chronology of the incident** Feb 8, 2024, 4:20 PM EST – An increase in error rate was observed. Feb 8, 2024, 4:25 PM EST – Monitoring systems detected anomalies, prompting the RebelMouse team to initiate an investigation. Feb 8, 2024, 5:00 PM EST – Error rates experienced a significant surge. Feb 8, 2024, 5:16 PM EST – The RebelMouse team officially categorized the incident as Major and communicated it through the Status Portal. Feb 8, 2024, 5:30 PM EST – The root cause was pinpointed: unavailability in launching new instances within the EKS cluster. Feb 8, 2024, 6:00 PM EST – The RebelMouse team rectified the issue by updating the network configuration and manually launching required instances to restore system performance. Feb 8, 2024, 8:51 PM EST – RebelMouse initiated a support request regarding AWS services outage. Feb 8, 2024, 9:10 PM EST – Systems reconfiguration was completed, and the team entered monitoring mode. Feb 8, 2024, 10:10 PM EST – The incident was officially resolved. Feb 10, 2024, 2:30 AM EST – AWS confirmed an issue with the EKS service in the us-east-1 region during the specified period, and services have been restored. ## **The impact of the incident** Stores multiple key services hosted on AWS us-east-1 region for RebelMouse were impacted leading to partial unavailability. The underlying cause if known The root cause of this problem has been identified as a networking issue within AWS, specifically affecting the EKS service within the us-east-1 region. AWS acknowledged the issue and the team was actively working on resolving it. ## **Actions taken** RebelMouse engineering teams were engaged as soon as the problem was identified. They worked diligently to resolve the issue in the fastest manner possible while keeping customers updated about the situation. ## **Preventive Measures** We have recognized the importance of enhancing our strategies for handling potential networking issues. Going forward, we will seek opportunities to mitigate these challenges by implementing extensive caching systems and boosting our redundant capacity for caching.
Status: Postmortem
Impact: Minor | Started At: Feb. 8, 2024, 10:16 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.