Last checked: 4 minutes ago
Get notified about any outages, downtime or incidents for RebelMouse and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for RebelMouse.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
AWS ec2-us-east-1 | Active |
AWS elb-us-east-1 | Active |
AWS RDS | Active |
AWS route53 | Active |
AWS s3-us-standard | Active |
AWS ses-us-east-1 | Active |
Braintree API | Active |
Braintree PayPal Processing | Active |
CDN | Active |
Celery | Active |
Content Delivery API | Active |
Discovery | Active |
EKS Cluster | Active |
Active | |
Fastly Amsterdam (AMS) | Active |
Fastly Hong Kong (HKG) | Active |
Fastly London (LHR) | Active |
Fastly Los Angeles (LAX) | Active |
Fastly New York (JFK) | Active |
Fastly Sydney (SYD) | Active |
Full Platform | Active |
Google Apps Analytics | Active |
Logged In Users | Active |
Media | Active |
Mongo Cluster | Active |
Pharos | Active |
RabbitMQ | Active |
Redis Cluster | Active |
Sentry Dashboard | Active |
Stats | Active |
Talaria | Active |
Active | |
WFE | Active |
View the latest incidents for RebelMouse and check for official updates:
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: July 12, 2024, 10:20 a.m.
Description: # Chronology of the incident 12:41 UTC we received notification about slow performance 13:40 UTC as cluster was rapidly scaling to it’s maximum capacity we started to disconnect non critical services because we knew that most of them would put indirect load on already fast scaling out main application cluster 13:47 UTC caused excessive number of connections to mongo 13:49 UTC caused one the mongoDB replica set instance failure, we identified that it was due to nofile limit was reached \(kernel parameter\) 13:52 UTC started mongoDB replica set instance recovery process 15:01 UTC applied increased limits of nofile and nproc to all mongo servers # The impact of the incident While the website for end-users and crawlers function without meaningful disruption, the incident resulted in partial performance degradation of editorial clusters and non essential services, like automations or javascript runtimes. # The underlying cause Because of an increased demand of main cluster capacity MongoDB replica set hit nofile and nproc limits. # Actions taken & Preventive Measures * Reconfigured MongoDB cluster doubling nofile and nproc limits * Reconfigure MongoDB Arbiter to reduce CoreDNS load x5
Status: Postmortem
Impact: None | Started At: May 30, 2024, 1:52 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: None | Started At: May 29, 2024, 7:39 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: None | Started At: May 29, 2024, 7:39 p.m.
Description: # Chronology of the incident At 16:27 UTC, we detected a significant load on our servers. By 16:43 UTC, it was identified that the CoreDNS server was suffering performance degradation due to the scaling out of applications within our Kubernetes cluster. This situation was further complicated by a performance degradation in our MongoDB database at 16:55 UTC, caused by an excessive open number of connections initiated by the scaling applications. An emergency meeting was escalated at 17:04 UTC and the source of the excessive load to DNS servers was identified at 17:16 UTC. Measures were immediately taken to optimize DNS queries across the Kubernetes cluster by reducing the number of DNS clients, which mainly involved halting non-essential services. These measures led to an initial recovery of performance at 17:30 UTC and subsequently, a fix was developed for the CoreDNS configuration, identified as the root cause of the issues. Unfortunately, at 19:16 UTC, a restart of CoreDNS led to a performance degradation on the editorial clusters and unveiled the unavailability of one of the MongoDB replica set instances. This was attributed to the restart, which caused a cache purge, thus highlighting the magnitude of the MongoDB performance degradation. We identified that this issue with MongoDB had a significant impact on the performance of our CoreDNS systems, further complicating the situation. Recognizing the severity of the situation, we immediately launched a recovery process for the MongoDB replica set. As we progressed with damage control, a preliminary trial was made to reinstate these services. Despite our efforts, the services reactivation led to significant setbacks, notably impacting the overall performance of the editorial web platform. However, it's crucial to note that the websites for end users and crawlers maintained its functionality and continued to operate as expected with no major degradation. To reinforce our commitment to operational stability, we opted to keep the service offline pending a comprehensive investigation and resolution of the underlying issues with the MongoDB database. These measures facilitated a full recovery of the MongoDB system by 21:10 UTC. Post recovery, we continued to monitor the situation over a specified period before cautiously reactivating services signaled the end of the active incident. # The impact of the incident While the website for end-users and crawlers function without meaningful disruption, the incident resulted in partial performance degradation of editorial clusters and non essential services, like automations or javascript runtimes. # The underlying cause The incident was triggered by a combination of factors, including an aggressive web crawler, a surge in cache invalidations due to layout updates, and a suboptimal configuration of CoreDNS. # Actions taken & Preventive Measures Reconfigured CoreDNS setup significantly increasing the service capacity. As a preventive measure we are going to create an update for our inhouse built cache logic to spread the cache revalidation process in time to prevent requests spikes to “origins”
Status: Postmortem
Impact: None | Started At: May 29, 2024, 5:09 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.