Last checked: 14 minutes ago
Get notified about any outages, downtime or incidents for Bitmovin and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Bitmovin.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Bitmovin Dashboard | Active |
Player Licensing | Active |
Analytics Service | Active |
Analytics Ingress | Active |
Export Service | Active |
Query Service | Active |
Bitmovin API | Active |
Account Service | Active |
Configuration Service | Active |
Encoding Service | Active |
Infrastructure Service | Active |
Input Service | Active |
Manifest Service | Active |
Output Service | Active |
Player Service | Active |
Statistics Service | Active |
View the latest incidents for Bitmovin and check for official updates:
Description: Between 11:15 and 13:45, Bitmovin Analytics data collection experienced load balancing issues in our European datacenter. The load balancer began excessively auto-scaling our fleet of instances and subsequently performed a full traffic shift to the US datacenter. Following the traffic shift, the instances in Europe were terminated and restarted, causing some requests on those instances to be lost and not written to our database. We stabilized the system by modifying the load balancing behavior and are currently investigating the root cause of this incident. We apologize for the inconvenience and will post a full Root Cause Analysis (RCA) once the investigation is complete, along with corrective actions taken to prevent similar issues in the future.
Status: Resolved
Impact: None | Started At: June 14, 2024, 9:15 a.m.
Description: The incident has been resolved and all data was successfully backfilled as of 16:55 May 2024-05-21.
Status: Resolved
Impact: Major | Started At: May 21, 2024, 2:22 p.m.
Description: ## Summary A manual cleanup routine got stalled and caused a lock on certain database tables that are necessary to manage encoding jobs. The API-related endpoints returned HTTP 500 errors during that time. Customers depending on that endpoint \(either directly via the API or indirectly via the dashboard\) could not properly do so. After identifying and fixing the cause, the involved endpoints returned to normal operation. ## Date The issue occurred on December 1, 2023, between 07:40 and 8:15. All times in UTC. ## Root Cause A routine manual cleanup procedure caused a lock on certain database tables and stalled so that the locks could not be released. Services depending on this database resource were then impacted and unable to process API requests. ## Implications Customers were not able to start encodings. Some encoding jobs had longer than expected turnaround times. The involved API requests targeting the encoding endpoint returned HTTP 500 errors. ## Remediation The faulty database operation was identified and terminated. ## Timeline 07:40 - Internal alerts notified the team about failures. 07:50 - The team began investigating. 08:00 - The faulty component was identified. The team began investigating the involved operations. 08:15 - The faulty operation was identified and terminated. The affected service recovered. 08:20 - The team kept monitoring and verifying the proper operation of the service. ## Prevention The process for the cleanup procedure has been updated to not use the procedure that caused this incident. The team will analyze this procedure in detail to understand why it caused a lock on the database and stalled. Measures to prevent this procedure from stalling will be taken. As soon as the updated procedure is safe again, the team will continue to use it to fulfill the required maintenance tasks.
Status: Postmortem
Impact: Major | Started At: Dec. 1, 2023, 8:04 a.m.
Description: ## Summary Bitmovin’s engineering team observed failing encoding jobs configured to run on Azure. They also got informed of suspicious activity in Bitmovin’s Microsoft Azure Subscription used for Bitmovin Managed Encoding running in Azure regions. Launching infrastructure on this subscription was deactivated without prior notification. This prevented Bitmovin from launching a new computing infrastructure, leading to encoding job failure with “Scheduling failed” error messages. Encoding jobs configured to run on other cloud regions like AWS or Google were not affected at any time. Customers were instructed to fall back to cloud regions in AWS and Google. Bitmovin moved all compute to an alternative Azure subscription to unblock customers running encoding jobs in Azure regions. Microsoft admitted an incorrect detection and thus resource block on Bitmovin’s Azure subscription. ## Date The issue occurred on September 11, 2023, between 14:12 and September 13, 17:09. All times in UTC. ## Root Cause Microsoft Azure's Suspicious Activity Detector incorrectly identified Bitmovin's request for additional resources as suspicious, leading to the deactivation of Bitmovin’s main subscription. Microsoft has since identified the logic is too stringent in looking at abuse patterns; has adjusted this detection, and applied further quality controls to avoid resource blocks being applied incorrectly. Our scheduling logic received missing capacity errors while requesting new instances in our main Azure subscription caused by the incorrectly applied resource block by Azure. This led to “Scheduling failed” error messages for customers running encoding jobs in Azure regions. ## Implications Workloads scheduled by customers using Managed Encoding in Azure could not be processed. The encoding jobs immediately transitioned to the error state. Other cloud vendors were not affected. The Cloud Connect feature for Azure Infrastructure was partially and temporarily impacted. ## Remediation The affected customers were notified and advised to change their encoding job configuration to utilize another cloud provider to process the encoding jobs. Customer communication was completed directly by the Bitmovin Customer Experience team and Status Page. The Bitmovin Engineering team switched the managed Azure subscription to an alternative one which was not affected by the resource blocks. ## Timeline Sep 11, 14:12 - The Engineering team observed failing encoding jobs configured to run on Azure and started investigating. Sep 11, 14:16 - The Engineering team identified a resource block on the Bitmovin Azure subscription as the root cause of the failure. Sep 11, 15:30 - A Support case with our Azure partner was opened. The support case was escalated via our partner contacts. Sep 11, 15:40 - The Bitmovin support team started contacting customers running on Azure regions and advised them to switch the configuration of the encoding jobs to run on an alternative cloud provider. Sep 12, 07:00 - The engineering team started working on a solution to switch the Azure encoding workloads to another Azure subscription. Sep 12, 10:04 - The engineering team updated the scheduling logic to make a limited set of Azure regions available again for encoding workloads running on the prepared Azure subscription. Turnaround times were longer than usual as they did not run at full capacity yet. Sep 12, 16:09 - The remaining Azure regions were also made available for Azure encoding workloads using the same strategy. Sep 12, 21:17 - The Azure support ticket to remove the resource blocks was manually escalated via Microsoft directly. Sep 13, 13:15 - Engineering rolled out an update that enabled normal turnaround times for Azure encoding jobs again. Sep 13, 17:09 - The Bitmovin incident was resolved - all Bitmovin customers could run encoding jobs on Azure again. Sep 14, 08:00 - The Engineering team works on getting the original Azure subscription activated again together with the Partner and Microsoft. They are also working to understand the root cause as to why the subscription got disabled and work on a solution to prevent this in the future. Sept 14, 17:30 - Microsoft provided Bitmovin with an RCA saying that the newly added compromise detection logic was too stringent and a response analyst inaccurately validated the subscription leading to the resource blocks being applied incorrectly. ## Prevention The Engineering team will work with Microsoft Azure and our partner to prevent such situations in the future. The Engineering team will keep the Azure subscription failover implemented as a temporary solution and adapt tooling to make switching between Azure subscriptions easier. Microsoft confirmed they have adjusted their incorrect detection and applied further quality controls to avoid resource blocks being applied incorrectly.
Status: Postmortem
Impact: Major | Started At: Sept. 11, 2023, 3:15 p.m.
Description: We continued to monitor the situation and everything is working as expected and the error rates in our monitoring are at normal levels. As the duration of the incident was rather short our recommended retry behavior for 5xx errors on the API should have kept the impact minimal. The root cause of the incident was a database migration where a column was added to a huge table which caused it to lock for a few minutes. This table lock prevented successful execution on most encoding-related API calls. With our reduction of the database size which will start in July, this will not happen anymore. We additionally raised awareness in the team to pay extra attention when modifying huge database tables.
Status: Resolved
Impact: Major | Started At: June 22, 2023, 11:12 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.