Last checked: 9 minutes ago
Get notified about any outages, downtime or incidents for OpenAI and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for OpenAI.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
API | Active |
ChatGPT | Active |
Labs | Active |
Playground | Active |
View the latest incidents for OpenAI and check for official updates:
Description: Between 6:34 AM PT and 8:40 AM PT today, we experienced a service disruption affecting ChatGPT users, resulting in higher-than-usual error rates and delays. Our team identified and addressed the issue by increasing system capacity. All services are now operating normally. This incident is now resolved.
Status: Resolved
Impact: Major | Started At: May 22, 2024, 2:01 p.m.
Description: On May 21, 2024 from 11:25 am PT to 12:26 pm PT a large portion of requests to ChatGPT gpt-4o free tier were failing with 5xx status codes. At 11:30 am PT, engineers noticed that the configured routing information for gpt-4o had no backing services anymore and took immediate action to reapply the expected configuration to the system. First, rate limits were decreased for free tier traffic to prevent overwhelming the few instances that would be backing gpt-4o. Traffic was slowly dialed back up as the system performing re-configuration slowly added more and more of the backing services to the routing configuration. Due to the nature of the systems, this configuration process ramps up traffic in a staged manner and takes several minutes in between steps to configure more capacity for a service, hence the large length of the incident relative the time to execution of a mitigation. The root cause was later determined to be a code bug in a new code path to enable draining all workloads in a cluster, an operation that was being attempted for the first time. An error in the logic led to a misconfiguration that resulted in 100% loss of the services that back gpt-4o free tier, spanning multiple clusters, instead of just impacting a single cluster. This meant there were no backing services configured to answer requests for gpt-4o free tier traffic, in line with the symptoms observed by the triaging engineer. The core bug causing the incident has been fixed. Further hardening is being undertaken to introduce inertia to the process of cluster drain as to avoid the same level of catastrophic loss and to prematurely warn operators before large actions are taken. The systems are also being adapted to better explain specifically what actions will be performed when enacting such large operations. Additionally, the team is making it easier to more quickly undo changes, something that prevented us from reverting the issue more quickly. We know that outages to the ChatGPT service affect our customers. While we came up short here, we are committed to preventing such incidents in the future.
Status: Postmortem
Impact: Major | Started At: May 21, 2024, 6:40 p.m.
Description: On May 21, 2024 from 11:25 am PT to 12:26 pm PT a large portion of requests to ChatGPT gpt-4o free tier were failing with 5xx status codes. At 11:30 am PT, engineers noticed that the configured routing information for gpt-4o had no backing services anymore and took immediate action to reapply the expected configuration to the system. First, rate limits were decreased for free tier traffic to prevent overwhelming the few instances that would be backing gpt-4o. Traffic was slowly dialed back up as the system performing re-configuration slowly added more and more of the backing services to the routing configuration. Due to the nature of the systems, this configuration process ramps up traffic in a staged manner and takes several minutes in between steps to configure more capacity for a service, hence the large length of the incident relative the time to execution of a mitigation. The root cause was later determined to be a code bug in a new code path to enable draining all workloads in a cluster, an operation that was being attempted for the first time. An error in the logic led to a misconfiguration that resulted in 100% loss of the services that back gpt-4o free tier, spanning multiple clusters, instead of just impacting a single cluster. This meant there were no backing services configured to answer requests for gpt-4o free tier traffic, in line with the symptoms observed by the triaging engineer. The core bug causing the incident has been fixed. Further hardening is being undertaken to introduce inertia to the process of cluster drain as to avoid the same level of catastrophic loss and to prematurely warn operators before large actions are taken. The systems are also being adapted to better explain specifically what actions will be performed when enacting such large operations. Additionally, the team is making it easier to more quickly undo changes, something that prevented us from reverting the issue more quickly. We know that outages to the ChatGPT service affect our customers. While we came up short here, we are committed to preventing such incidents in the future.
Status: Postmortem
Impact: Major | Started At: May 21, 2024, 6:40 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: May 17, 2024, 1:18 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: May 17, 2024, 1:18 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.