Last checked: 50 seconds ago
Get notified about any outages, downtime or incidents for Bitmovin and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Bitmovin.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Bitmovin Dashboard | Active |
Player Licensing | Active |
Analytics Service | Active |
Analytics Ingress | Active |
Export Service | Active |
Query Service | Active |
Bitmovin API | Active |
Account Service | Active |
Configuration Service | Active |
Encoding Service | Active |
Infrastructure Service | Active |
Input Service | Active |
Manifest Service | Active |
Output Service | Active |
Player Service | Active |
Statistics Service | Active |
View the latest incidents for Bitmovin and check for official updates:
Description: All systems are fully operational again and all incoming data that was buffered during the outage is fully available again. We are going to conduct a thorough RCA and post a Postmortem explaining the incident as well as the actions derived to prevent similar issues in the future.
Status: Resolved
Impact: Major | Started At: Nov. 6, 2024, 8:10 a.m.
Description: The issue is fully resolved and we suggest customers re-run any exports that have failed.
Status: Resolved
Impact: Minor | Started At: Oct. 30, 2024, 5:25 p.m.
Description: All analytics data is now backfilled and the system is fully operational again.
Status: Resolved
Impact: Major | Started At: Oct. 27, 2024, 8:40 p.m.
Description: ## Summary A component in charge of provisioning infrastructure resources was in an overload situation which caused long queue times and partially to scheduling errors on AWS. To stabilize the system we scaled down job processing and gradually scaled up again in a controlled way. ## Date The issue occurred on September 12, 2024, between 12:04 and 15:03. All times in UTC. ## Root Cause An unusual spike in encoding job processing that was not smoothed out by our scheduling algorithm caused an overload in the component responsible for requesting instances for encoding job processing in AWS. The component could not handle the amount of work and recover itself, affecting all other jobs on AWS. ## Implications Encoding jobs that were started remained in the queued state. Some jobs failed to start and transitioned to the error state with a “Scheduling failed” message. ## Remediation The engineering team quickly identified the affected component causing the long queue times and scheduling failed errors. The load on this component was reduced by delaying the processing of encoding jobs. This then allowed the overloaded component to recover. Once it had recovered the processing of jobs was ramped up to normal operations again. The reduction in job processing also delayed non-AWS Encoding jobs. ## Timeline 12:04 - The monitoring systems alerted the engineering team about an overloaded system component and started investigating. 12:15 - The engineering team closely monitored the impacted component to identify the impact. 12:32 - The engineering team started investigating different approaches to let the impacted system recover. 13:30 - The engineering team identified that customer job processing on AWS is impacted and reduced the number of jobs that are processed in the system. 14:00 - The component recovered and the engineering team started to scale up the encoding job processing again. 14:24 - The full processing capacity was restored and the system continued to process the queued jobs normally. 15:03 - The engineering team continued closely monitoring the systems. ## Prevention After the first investigations, the engineering team will do the following actions to prevent a similar overload scenario in the future of this component: * Scale the underlying database to a bigger instance type * Improve the scheduling algorithm of the system to smooth out peak load patterns * Review data access patterns to avoid high load of the component Eventually, the specific scenario that led to the overload situation will be simulated in a separate environment to validate the prevention measures are working as expected.
Status: Postmortem
Impact: Major | Started At: Sept. 12, 2024, 1:31 p.m.
Description: Between 11:15 and 13:45, Bitmovin Analytics data collection experienced load balancing issues in our European datacenter. The load balancer began excessively auto-scaling our fleet of instances and subsequently performed a full traffic shift to the US datacenter. Following the traffic shift, the instances in Europe were terminated and restarted, causing some requests on those instances to be lost and not written to our database. We stabilized the system by modifying the load balancing behavior and are currently investigating the root cause of this incident. We apologize for the inconvenience and will post a full Root Cause Analysis (RCA) once the investigation is complete, along with corrective actions taken to prevent similar issues in the future.
Status: Resolved
Impact: None | Started At: June 14, 2024, 9:15 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.