Company Logo

Is there an Qubole outage?

Qubole status: Systems Active

Last checked: a minute ago

Get notified about any outages, downtime or incidents for Qubole and 1800+ other cloud vendors. Monitor 10 companies, for free.

Subscribe for updates

Qubole outages and incidents

Outage and incident data over the last 30 days for Qubole.

There have been 0 outages or incidents for Qubole in the last 30 days.

Severity Breakdown:

Tired of searching for status updates?

Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!

Sign Up Now

Components and Services Monitored for Qubole

Outlogger tracks the status of these components for Xero:

Cluster Operations Active
Cluster Operations Active
Command Processing Active
Command Processing Active
Notebooks Active
Notebooks Active
QDS API Active
QDS API Active
Qubole Community & Support Portal Active
Qubole Scheduler Active
Qubole Scheduler Active
Site Availability Active
Site Availability Active
Cluster Operations Active
Command Processing Active
Notebooks Active
QDS API Active
Qubole Scheduler Active
Site Availability Active
Cluster Operations Active
Command Processing Active
Notebooks Active
QDS API Active
Qubole Scheduler Active
Site Availability Active
Cluster Operations Active
Command Processing Active
Notebooks Active
QDS API Active
Qubole Scheduler Active
Site Availability Active
Cluster Operations Active
Command Processing Active
Notebooks Active
QDS API Active
Qubole Scheduler Active
Site Availability Active
Cluster Operations Active
Command Processing Active
Notebooks Active
QDS API Active
Qubole Scheduler Active
Site Availability Active
Cluster Operations Active
Command Processing Active
Notebooks Active
QDS API Active
Qubole Scheduler Active
Site Availability Active
Cluster Operations Active
Command Processing Active
Notebooks Active
QDS API Active
Quantum Active
Qubole Scheduler Active
Site Availability Active
Cluster Operations Active
Command Processing Active
Notebooks Active
QDS API Active
Qubole Scheduler Active
Site Availability Active
Component Status
Cluster Operations Active
Cluster Operations Active
Command Processing Active
Command Processing Active
Notebooks Active
Notebooks Active
QDS API Active
QDS API Active
Qubole Community & Support Portal Active
Qubole Scheduler Active
Qubole Scheduler Active
Site Availability Active
Site Availability Active
Active
Cluster Operations Active
Command Processing Active
Notebooks Active
QDS API Active
Qubole Scheduler Active
Site Availability Active
Active
Cluster Operations Active
Command Processing Active
Notebooks Active
QDS API Active
Qubole Scheduler Active
Site Availability Active
Active
Cluster Operations Active
Command Processing Active
Notebooks Active
QDS API Active
Qubole Scheduler Active
Site Availability Active
Active
Cluster Operations Active
Command Processing Active
Notebooks Active
QDS API Active
Qubole Scheduler Active
Site Availability Active
Active
Cluster Operations Active
Command Processing Active
Notebooks Active
QDS API Active
Qubole Scheduler Active
Site Availability Active
Active
Cluster Operations Active
Command Processing Active
Notebooks Active
QDS API Active
Qubole Scheduler Active
Site Availability Active
Active
Cluster Operations Active
Command Processing Active
Notebooks Active
QDS API Active
Quantum Active
Qubole Scheduler Active
Site Availability Active
Active
Cluster Operations Active
Command Processing Active
Notebooks Active
QDS API Active
Qubole Scheduler Active
Site Availability Active

Latest Qubole outages and incidents.

View the latest incidents for Qubole and check for official updates:

Updates:

  • Time: May 16, 2022, 11:18 p.m.
    Status: Resolved
    Update: The issue with degradation in performance on API has been resolved. Customers should be able to execute their jobs and workloads now.
  • Time: May 16, 2022, 10:15 p.m.
    Status: Identified
    Update: The issue with degradation in performance on API has been resolved. Customers should be able to execute their jobs and workloads now.
  • Time: May 16, 2022, 7:04 p.m.
    Status: Identified
    Update: The available space in the RStore MySQL database is reaching its limit and causing degraded performance. To resolve the issue we are in the progress of: 1. The quickest resolution to the issue is for AWS to convert our filesystem on classic from 32 bit to 64 bit on classic. This will expand the MySQL limit on data file size from 2TB to 64TB. 2. For AWS to complete this we need to create a read replica for them to convert while the current rStore stays running. 3. Once the conversion is complete we can cut over to the new instance on the 64-bit file system. 4. AWS estimates about 8 hrs to do the conversion, the Read Replica is currently being created. Once we have that then we will have an estimate on getting a fix in the performance degradation
  • Time: May 16, 2022, 5:09 p.m.
    Status: Investigating
    Update: We are seeing a degradation in performance on API and are looking into it currently, we will have an update in the next hour.

Updates:

  • Time: May 12, 2022, 7:50 a.m.
    Status: Resolved
    Update: The issue with the rStore database has been resolved. Customers should be able to execute their jobs and workloads now.
  • Time: May 12, 2022, 7:50 a.m.
    Status: Resolved
    Update: The issue with the rStore database has been resolved. Customers should be able to execute their jobs and workloads now.
  • Time: May 12, 2022, 3:02 a.m.
    Status: Monitoring
    Update: The issue with the rStore database has been resolved. Customers should be able to execute their jobs and workloads now.
  • Time: May 12, 2022, 3:02 a.m.
    Status: Monitoring
    Update: The issue with the rStore database has been resolved. Customers should be able to execute their jobs and workloads now.
  • Time: May 12, 2022, 1:06 a.m.
    Status: Identified
    Update: We are still proceeding with the plan as outlined and on track to complete by 12:00 CST. We will update here if there are any changes.
  • Time: May 12, 2022, 1:06 a.m.
    Status: Identified
    Update: We are still proceeding with the plan as outlined and on track to complete by 12:00 CST. We will update here if there are any changes.
  • Time: May 11, 2022, 10:10 p.m.
    Status: Identified
    Update: Upon further investigation and working with AWS support we have a new update and plan: 1. In working with AWS this afternoon, DevOps figured out that a table reached the MySQL 2TB limit. This table is a system table so we cannot delete data. 2. The cause is that multiple tables are writing to the same file. Good practice would have been to have a separate datafile for each table, which was not the case. 3. To fix they will: -Backup a handful of tables they are going to move data into their own files. -Drop those tables and recreate them with their own data files. -Restore the data to those tables which should move the data into their own data files and split it out of the data file with the 2TB limit thus freeing space. 4. This should defragment the database and free up space while decreasing the file size of the data file running into the limit. This will be a temporary measure to get back up and running. The process of testing and implementation should take the next 8 hrs or so depending on the data load. We estimate that by 12:00 CST to be complete and back up. The long term solution is to rebuild the entire database. That can be done offline and then cutover to it once it's ready, so no downtime would be involved. We have done similar updates in the other regions with no impact or downtime with customers.
  • Time: May 11, 2022, 10:10 p.m.
    Status: Identified
    Update: Upon further investigation and working with AWS support we have a new update and plan: 1. In working with AWS this afternoon, DevOps figured out that a table reached the MySQL 2TB limit. This table is a system table so we cannot delete data. 2. The cause is that multiple tables are writing to the same file. Good practice would have been to have a separate datafile for each table, which was not the case. 3. To fix they will: -Backup a handful of tables they are going to move data into their own files. -Drop those tables and recreate them with their own data files. -Restore the data to those tables which should move the data into their own data files and split it out of the data file with the 2TB limit thus freeing space. 4. This should defragment the database and free up space while decreasing the file size of the data file running into the limit. This will be a temporary measure to get back up and running. The process of testing and implementation should take the next 8 hrs or so depending on the data load. We estimate that by 12:00 CST to be complete and back up. The long term solution is to rebuild the entire database. That can be done offline and then cutover to it once it's ready, so no downtime would be involved. We have done similar updates in the other regions with no impact or downtime with customers.
  • Time: May 11, 2022, 8:04 p.m.
    Status: Identified
    Update: As per the last update, we are still in the progress of moving the data.
  • Time: May 11, 2022, 8:04 p.m.
    Status: Identified
    Update: As per the last update, we are still in the progress of moving the data.
  • Time: May 11, 2022, 3:08 p.m.
    Status: Identified
    Update: Latest Update: What caused the outage * The Rstore database had a table that filled up and also caused the disk space to fill up, which caused the database to not respond. Customers are not able to run jobs because of the unresponsive Rstore database What has been done to resolve so far * Increased memory and storage on instance * The table was cleared but the disk space was not reclaimed and is still full. * Engaged AWS and determined that we cannot set the parameter for the table to autoscale because it has to be set upon creation. * Created a new instance from the old database increased storage and memory. What’s Next * The new mySQL database in in place, and setup is complete. * Export data to S3 from prior DB, in progress. * Import Data from prior instance to new instance. Estimated ETA to complete the data load is 24hrs due to the size of the MySQL database (1TB+). We are working with AWS to identify any methods to decrease data load time. We will provide updates here if there is any change to the timeline.
  • Time: May 11, 2022, 3:08 p.m.
    Status: Identified
    Update: Latest Update: What caused the outage * The Rstore database had a table that filled up and also caused the disk space to fill up, which caused the database to not respond. Customers are not able to run jobs because of the unresponsive Rstore database What has been done to resolve so far * Increased memory and storage on instance * The table was cleared but the disk space was not reclaimed and is still full. * Engaged AWS and determined that we cannot set the parameter for the table to autoscale because it has to be set upon creation. * Created a new instance from the old database increased storage and memory. What’s Next * The new mySQL database in in place, and setup is complete. * Export data to S3 from prior DB, in progress. * Import Data from prior instance to new instance. Estimated ETA to complete the data load is 24hrs due to the size of the MySQL database (1TB+). We are working with AWS to identify any methods to decrease data load time. We will provide updates here if there is any change to the timeline.
  • Time: May 11, 2022, 10:13 a.m.
    Status: Identified
    Update: -Right now, the Task is Under Investigation. -Given the current RDS DB (MySQL) instance is using the deprecated major version (5.6.39) and the tablespace seems full even after applying the innodb_file_per_table=1. -The team is currently working to migrate the environment along with DB to a supported version of MySQL. We are continuing to investigate and will update accordingly.
  • Time: May 11, 2022, 3:50 a.m.
    Status: Identified
    Update: Latest updates: -Cleared the storage issues and the low memory on the longer running tunnels. -Updated the RDS memory from 5000 GB to 5500 GB in the production rstore RDS instance as well as the replicate production rstore. This takes about 6 hours as per Amazon document. We started it about 5PM CST, so around 11PM CST the updated instance with added memory size should be up and running After taking steps to free up storage the issue still exists and the storage is not being released. We are continuing to investigate and will update accordingly.
  • Time: May 11, 2022, 3:50 a.m.
    Status: Identified
    Update: Latest updates: -Cleared the storage issues and the low memory on the longer running tunnels. -Updated the RDS memory from 5000 GB to 5500 GB in the production rstore RDS instance as well as the replicate production rstore. This takes about 6 hours as per Amazon document. We started it about 5PM CST, so around 11PM CST the updated instance with added memory size should be up and running After taking steps to free up storage the issue still exists and the storage is not being released. We are continuing to investigate and will update accordingly.
  • Time: May 11, 2022, 1:15 a.m.
    Status: Identified
    Update: We continue to work on clearing resources and expanding the limits in the rStore database. We should have an ETA shortly.
  • Time: May 11, 2022, 1:15 a.m.
    Status: Identified
    Update: We continue to work on clearing resources and expanding the limits in the rStore database. We should have an ETA shortly.
  • Time: May 10, 2022, 9:37 p.m.
    Status: Identified
    Update: We have identified a full table in the Rstore database that appears to be causing the issue. We are in the process of clearing that condition.
  • Time: May 10, 2022, 9:37 p.m.
    Status: Identified
    Update: We have identified a full table in the Rstore database that appears to be causing the issue. We are in the process of clearing that condition.
  • Time: May 10, 2022, 7:39 p.m.
    Status: Investigating
    Update: Several customers are experiencing issues when scheduling jobs. We are looking into the matter and will update shortly.
  • Time: May 10, 2022, 7:39 p.m.
    Status: Investigating
    Update: Several customers are experiencing issues when scheduling jobs. We are looking into the matter and will update shortly.

Updates:

  • Time: May 3, 2022, 5:01 p.m.
    Status: Resolved
    Update: The degraded performance issue on wellness.qubole.com is resolved.
  • Time: May 3, 2022, 5:01 p.m.
    Status: Resolved
    Update: The degraded performance issue on wellness.qubole.com is resolved.
  • Time: May 3, 2022, 1:04 p.m.
    Status: Identified
    Update: DevOps continues to work on the resolution for the issue in the control plane for wellness.qubole environment. We should have the issue resolved within the next few hours. We will update accordingly.
  • Time: May 3, 2022, 1:04 p.m.
    Status: Identified
    Update: DevOps continues to work on the resolution for the issue in the control plane for wellness.qubole environment. We should have the issue resolved within the next few hours. We will update accordingly.
  • Time: May 3, 2022, 4:10 a.m.
    Status: Identified
    Update: DevOps has identified the issue with an internal component and they are implementing a fix for this to resolve the issue at the earliest.
  • Time: May 3, 2022, 4:10 a.m.
    Status: Identified
    Update: DevOps has identified the issue with an internal component and they are implementing a fix for this to resolve the issue at the earliest.
  • Time: May 3, 2022, 3:24 a.m.
    Status: Identified
    Update: DevOps has identified the issue with an internal component and they are implementing a fix for this to resolve the issue at the earliest.
  • Time: May 3, 2022, 3:24 a.m.
    Status: Identified
    Update: DevOps has identified the issue with an internal component and they are implementing a fix for this to resolve the issue at the earliest.
  • Time: May 2, 2022, 10:05 p.m.
    Status: Investigating
    Update: We are facing the degraded performance in wellness.qubole environment and DevOps team is investigating the issue further

Updates:

  • Time: March 18, 2022, 11:28 a.m.
    Status: Resolved
    Update: The degraded performance issue on api.qubole.com is resolved.
  • Time: March 18, 2022, 5:10 a.m.
    Status: Identified
    Update: Thank you for your patience as completed resolution of the issues pertaining to the Qubole API control plane. Customers' scheduled jobs should run successfully at this point. If you are still seeing issues, please communicate those to support. 1. We will also continue to monitor all job queues. Currently all jobs seem to be queuing and ending fine.
  • Time: March 17, 2022, 11:52 p.m.
    Status: Identified
    Update: Thank you for your patience as completed resolution of the issues pertaining to the Qubole API control plane. Customers' scheduled jobs should run successfully at this point. If you are still seeing issues, please communicate those to support. 1.We will continue to monitor the bastion node connectivity. 2.We will also continue to monitor all job queues. Currently all jobs seem to be queueing and ending fine.
  • Time: March 17, 2022, 8:06 p.m.
    Status: Identified
    Update: Thank you for your patience as we work to resolve the last remaining issues pertaining to the Qubole API control plane. We have solved all the technical issues except for two points noted in points 1) and 4) below. Most customers' scheduled jobs should run successfully at this point. If you are still seeing issues, you may be one of the customers mentioned in point 1) or you may be experiencing an issue mentioned in point 3). Please communicate with support if you are still facing issues. 1.The "thrift.transport.TTransport.TTransportException” coming from python when attempting to make connection to VPC subnets seems to have been fixed for the customers that were impacted. We are currently verifying with those customers and will continue to monitor the bastion node connectivity. 2.We continue to monitor all queue jobs. Currently all jobs seem to be queueing and ending fine.
  • Time: March 17, 2022, 4:48 p.m.
    Status: Identified
    Update: Thank you for your patience as we work to resolve the last remaining issues pertaining to the Qubole API control plane. We have solved all the technical issues except for two points noted in points 1) and 4) below. Most customers' scheduled jobs should run successfully at this point. If you are still seeing issues, you may be one of the customers mentioned in point 1) or you may be experiencing an issue mentioned in point 4). Please communicate with support if you are still facing issues. 1. The "thrift.transport.TTransport.TTransportException” coming from python when attempting to make connection to VPC subnets is still occurring on select accounts. This error seems to be coming due to connectivity issues from the customer side. This is preventing communication to the Bastion nodes. We have verified this for one customer and informed them. We have found 5 more customer instances and are confirming. We are working with these customers to resolve. 2. DevOps has completed clearing jobs stuck in the queue. They will continue to monitor. 3. We continue to monitor all queue jobs. Currently all jobs seem to be queueing and ending fine. 4. The team found a shared tunnel elastic IP which is dangling (not mapped to tunnel server). The Team mapped it to an active tunnel this seems to resolve some of the connectivity issues we were seeing in VPC environments.
  • Time: March 17, 2022, 1:58 p.m.
    Status: Identified
    Update: The team is working on connectivity between SQL and RDS, also checking the tunnels which are configured and failing. Few accounts are still facing encryption errors (intermittent issues ) but these are network connectivity specific issues and team is working on it. Also some tunnel server ips are added on the cluster feature page.
  • Time: March 17, 2022, 9:34 a.m.
    Status: Identified
    Update: Technical team is trying to resolve "thrift.transport.TTransport.TTransportException error associated with tunnels, get_metastore_for_account - Couldn't create encrypted channel to rds and Unable to connect to bastion node. Once these errors resolved, will correct issues with command execution and issues with clusters. In addition, the technical team is monitoring the scheduled jobs.
  • Time: March 17, 2022, 6:08 a.m.
    Status: Identified
    Update: Devops Team have shared some of the most recent findings and ongoing activities to resolve the issue. There were three issues reported by customers: Commands were getting stuck from UI Clusters were not starting Scheduled jobs were getting stuck There is a common root cause behind these problems. Investigation from the team suggests that due to recent VPC changes (moving from classic non-vpc to vpc) some of the tunnel configurations have been impacted. Hence encrypted channels are intermittently failing from Qubole's control plane to customers' data plane. Team is working to rectify this. Regarding scheduled jobs- Team has now cleared all the stuck jobs from last one week and is continuing to monitor the service.
  • Time: March 17, 2022, 12:01 a.m.
    Status: Identified
    Update: Devops Teams have shared some of the most recent findings and ongoing activities to resolve the issue. 1. The suspected cause of api.q outage appears to be moving Scheduler tier from classic (non-vpc) to vpc by DevOps on 8th March without assessing the risk as part of R60 rollout preparation. Since the scheduler stopped working from 8th March, other infra components failed, tunnel servers were affected. The R60 build was also put on api.q. that later led to cross-connection between different conduits. The architects are reviewing all code in case we need revert and/or make configuration changes. 2. DevOps is continuing to clear jobs stuck in the queue. They are doing this incrementally so as not to overwhelm the tunnel servers as they begin to run. 3. We determined tunnels are misconfigured causing performance issues. We are fixing and changing tunnel servers out. This should fix the tunnel issues 4. The issue with python connectivity in VPC environments has been resolved and we are monitoring. We are seeing intermittent connectivity issues from the tunnel servers to the metastore for various customers. We believe that once we finish addressing #3 that this will be resolved. 5. The scheduler does nothing but run the job schedule and does not execute code. The architects are currently comparing the code in the scheduler on API with the code on US, which is working fine, to see what the code differences are.
  • Time: March 16, 2022, 7:16 p.m.
    Status: Identified
    Update: Devops Teams have shared some of the most recent findings and ongoing activities to resolve the issue. 1. The suspected cause of api.q outage appears to be moving Scheduler tier from classic (non-vpc) to vpc by DevOps on 8th March as part of R60 rollout preparation. Since the scheduler stopped working from 8th March, other infra components failed, tunnel servers were affected. The R60 build was also put on api.q. That later led to cross-connection between different conduits. The architects are piecing together any code that was not reverted back and/or any configuration changes that need to be reverted. 2. DevOps is clearing jobs that were stuck in the queue. They are doing this incrementally so as not to overwhelm the tunnel servers as they begin to run. 3. We determined that due to over rotation of tunnels that all tunnels are misconfigured. This is being addressed and should fix the tunnel issues. 4. The connection issue with python is due to connectivity issues with customers using private subnet. They can use a non-VPC or a VPC. It appears to be only VPC connections. We have moved python expertise to triage and solve this issue. 5. The scheduler does nothing but run the job schedule and does not execute code. The architects are currently comparing the code in the scheduler on API with the code on US, which is working fine, to see what the code differences are.
  • Time: March 16, 2022, 2:57 p.m.
    Status: Identified
    Update: The Devops team has cleared all the stuck jobs which were in the submitted state from March 8th Onwards. The Team is currently monitoring the requeued jobs which were under processing. To debug the intermittent job failure issue, team has put loggers on the code which was giving error. Team is also analyzing the RDS logs to check if any configuration change is required.
  • Time: March 16, 2022, 11:33 a.m.
    Status: Identified
    Update: The commands submitted manually are working fine. The Devops team performed the checks on the Tunnels and Nodes. The team discovered multiple jobs which were stuck in Scheduler, the team is Clearing all the stuck jobs.
  • Time: March 16, 2022, 7:53 a.m.
    Status: Identified
    Update: A few more errors of cross-connection between different conduits got detected. The technical team is debugging further.
  • Time: March 16, 2022, 4:12 a.m.
    Status: Identified
    Update: DevOps team is still working on the root cause of the issue and trying to resolve it soon.
  • Time: March 16, 2022, 1:25 a.m.
    Status: Identified
    Update: Our DevOps found some underlying issue that is causing the failure of the intermittent job and looking to find the root cause. DevOps team is actively monitoring all queue jobs and the jobs seem to queuing and ending fine.
  • Time: March 15, 2022, 10:48 p.m.
    Status: Identified
    Update: Qubole has been experiencing periodic outages on api.qubole. We are working to resolve this. We have resolved most issues, but if you are still experiencing issues with scheduled jobs not starting and finishing file a ticket with support. Below is a list of what has been done so far as well as our next plan of action to stabilize the platform. 1. Issue: The memcache connectivity from worker nodes was lost Resolution: Fixed the issues with the configuration and replaced the worker nodes. This fixed the major issues of jobs getting stuck Status : Done 2. Issue: Some of the worker and discovery tier nodes were trying to connect to VPC based Redis server and were failing . Resolution: Team fixed this issue by pointing the Redis server DNS to non VPC based Redis and resolved further issues in jobs getting stuck Status : Done 3. Issue: Common tunnels and dedicated tunnels started giving error due to heavy load built up due to jobs pileup. Resolution: The common tunnels and dedicated tunnels were replaced with new tunnel machines to ease the traffic. Status : Done 4. Issue: Chef Run is failing to execute few shell commands in the scheduler node. Resolution: Trying to re-run the Chef client manually, Issue fixed. Status : Done 5. Issue: Python connection issue Resolution: There still seems to be an underlying issue which is causing the intermittent jobs failure. Status : In-Progress 6. Issue: Cleanup of stuck jobs and clusters Resolution: Various customers are still seeing intermittent issues. These are due to cleanup needing to be done on a customer basis. We are waiting for further response from these customers Status : In-Progress 7. Issue: 100% disc space full Resolution: We have rotated the worker, client and web app tier nodes which were facing space issues Status : Done 8. Issue: Cleared jobs stuck in processing since March 7th Resolution: Changes the status to canceled Status : Done 9. Issue: Found Python 3.8 errors Resolution: Found the error in logs Status : In-progress. 10.Issue: Monitoring all jobs Resolution: DevOps team is actively monitoring all queued jobs Status : In-progress. . Current Action Items: -Debug the remaining customer issues and resolve asap -Monitoring the application health and quickly replace the bad tunnels (temporary measure till we resolve issue #6) -Working closely with Qubole Support team to quickly address the customer concerns related to specific clusters and critical jobs
  • Time: March 15, 2022, 7:47 p.m.
    Status: Identified
    Update: Our DevOps found some underlying issue that is causing the failure of the intermittent job and looking to find the root cause. DevOps team is actively monitoring all queue jobs and the jobs seem to queuing and ending fine.
  • Time: March 15, 2022, 5:49 p.m.
    Status: Identified
    Update: Our DevOps found some underlying issue that is causing the failure of the intermittent job and looking to find the root cause. DevOps team is actively monitoring all queue jobs and some of the jobs seem to queuing and ending fine.
  • Time: March 15, 2022, 2:50 p.m.
    Status: Identified
    Update: Our DevOps found some underlying issue that is causing the failure of the intermittent job and looking to find the root cause. DevOps team is actively monitoring all queue jobs and the jobs seem to queuing and ending fine.
  • Time: March 15, 2022, 11:44 a.m.
    Status: Identified
    Update: DevOps Team has identified jobs stuck in processing. Team is working to clear the stuck jobs. Team has rotated the worker, client, and webapp tiers node which were facing some space issues.
  • Time: March 15, 2022, 7:56 a.m.
    Status: Identified
    Update: The support team and the DevOps team are actively working on this and we are very close to resolving this issue. On the back side we are replacing the bad node and trying to fix it as soon as possible.
  • Time: March 15, 2022, 5:52 a.m.
    Status: Identified
    Update: The support team and the DevOps team are actively working on this and we are very close to resolving this issue. On the back side we are replacing the bad node and trying to fix it as soon as possible.
  • Time: March 15, 2022, 3:30 a.m.
    Status: Identified
    Update: All the clusters are starting fine manually. Cluster start up fails intermittently when scheduled command triggers cluster to start. Now Devops is working on this particular issue.
  • Time: March 15, 2022, 12:55 a.m.
    Status: Identified
    Update: All the clusters are starting fine manually. Cluster start up fails intermittently when scheduled command triggers cluster to start. Now Devops is working on this particular issue.
  • Time: March 14, 2022, 9:50 p.m.
    Status: Identified
    Update: All the clusters are starting fine manually. Cluster start up fails intermittently when scheduled command triggers cluster to start. Now Devops is working on this particular issue.
  • Time: March 14, 2022, 7:15 p.m.
    Status: Identified
    Update: All the clusters are starting fine manually. Cluster start up fails intermittently when scheduled command triggers cluster to start. Now Devops is working on this particular issue.
  • Time: March 14, 2022, 4:21 p.m.
    Status: Identified
    Update: All the clusters are starting fine manually. Cluster start up fails intermittently when scheduled command triggers cluster to start. Now Devops is working on this particular issue
  • Time: March 14, 2022, 1:16 p.m.
    Status: Identified
    Update: All the clusters are starting fine manually. Cluster start up fails intermittently when scheduled command triggers cluster to start. Now Devops is working on this particular issue.
  • Time: March 14, 2022, 10:12 a.m.
    Status: Identified
    Update: DevOps team is actively working on the issue with individual customers and trying to resolve it at the earliest.
  • Time: March 14, 2022, 8:49 a.m.
    Status: Investigating
    Update: DevOps team has identified the cause of the issue as scheduler autoscaling that is contributing to the remaining intermittent issues. They are currently working to resolve it.

Updates:

  • Time: March 18, 2022, 11:28 a.m.
    Status: Resolved
    Update: The degraded performance issue on api.qubole.com is resolved.
  • Time: March 18, 2022, 5:10 a.m.
    Status: Identified
    Update: Thank you for your patience as completed resolution of the issues pertaining to the Qubole API control plane. Customers' scheduled jobs should run successfully at this point. If you are still seeing issues, please communicate those to support. 1. We will also continue to monitor all job queues. Currently all jobs seem to be queuing and ending fine.
  • Time: March 17, 2022, 11:52 p.m.
    Status: Identified
    Update: Thank you for your patience as completed resolution of the issues pertaining to the Qubole API control plane. Customers' scheduled jobs should run successfully at this point. If you are still seeing issues, please communicate those to support. 1.We will continue to monitor the bastion node connectivity. 2.We will also continue to monitor all job queues. Currently all jobs seem to be queueing and ending fine.
  • Time: March 17, 2022, 8:06 p.m.
    Status: Identified
    Update: Thank you for your patience as we work to resolve the last remaining issues pertaining to the Qubole API control plane. We have solved all the technical issues except for two points noted in points 1) and 4) below. Most customers' scheduled jobs should run successfully at this point. If you are still seeing issues, you may be one of the customers mentioned in point 1) or you may be experiencing an issue mentioned in point 3). Please communicate with support if you are still facing issues. 1.The "thrift.transport.TTransport.TTransportException” coming from python when attempting to make connection to VPC subnets seems to have been fixed for the customers that were impacted. We are currently verifying with those customers and will continue to monitor the bastion node connectivity. 2.We continue to monitor all queue jobs. Currently all jobs seem to be queueing and ending fine.
  • Time: March 17, 2022, 4:48 p.m.
    Status: Identified
    Update: Thank you for your patience as we work to resolve the last remaining issues pertaining to the Qubole API control plane. We have solved all the technical issues except for two points noted in points 1) and 4) below. Most customers' scheduled jobs should run successfully at this point. If you are still seeing issues, you may be one of the customers mentioned in point 1) or you may be experiencing an issue mentioned in point 4). Please communicate with support if you are still facing issues. 1. The "thrift.transport.TTransport.TTransportException” coming from python when attempting to make connection to VPC subnets is still occurring on select accounts. This error seems to be coming due to connectivity issues from the customer side. This is preventing communication to the Bastion nodes. We have verified this for one customer and informed them. We have found 5 more customer instances and are confirming. We are working with these customers to resolve. 2. DevOps has completed clearing jobs stuck in the queue. They will continue to monitor. 3. We continue to monitor all queue jobs. Currently all jobs seem to be queueing and ending fine. 4. The team found a shared tunnel elastic IP which is dangling (not mapped to tunnel server). The Team mapped it to an active tunnel this seems to resolve some of the connectivity issues we were seeing in VPC environments.
  • Time: March 17, 2022, 1:58 p.m.
    Status: Identified
    Update: The team is working on connectivity between SQL and RDS, also checking the tunnels which are configured and failing. Few accounts are still facing encryption errors (intermittent issues ) but these are network connectivity specific issues and team is working on it. Also some tunnel server ips are added on the cluster feature page.
  • Time: March 17, 2022, 9:34 a.m.
    Status: Identified
    Update: Technical team is trying to resolve "thrift.transport.TTransport.TTransportException error associated with tunnels, get_metastore_for_account - Couldn't create encrypted channel to rds and Unable to connect to bastion node. Once these errors resolved, will correct issues with command execution and issues with clusters. In addition, the technical team is monitoring the scheduled jobs.
  • Time: March 17, 2022, 6:08 a.m.
    Status: Identified
    Update: Devops Team have shared some of the most recent findings and ongoing activities to resolve the issue. There were three issues reported by customers: Commands were getting stuck from UI Clusters were not starting Scheduled jobs were getting stuck There is a common root cause behind these problems. Investigation from the team suggests that due to recent VPC changes (moving from classic non-vpc to vpc) some of the tunnel configurations have been impacted. Hence encrypted channels are intermittently failing from Qubole's control plane to customers' data plane. Team is working to rectify this. Regarding scheduled jobs- Team has now cleared all the stuck jobs from last one week and is continuing to monitor the service.
  • Time: March 17, 2022, 12:01 a.m.
    Status: Identified
    Update: Devops Teams have shared some of the most recent findings and ongoing activities to resolve the issue. 1. The suspected cause of api.q outage appears to be moving Scheduler tier from classic (non-vpc) to vpc by DevOps on 8th March without assessing the risk as part of R60 rollout preparation. Since the scheduler stopped working from 8th March, other infra components failed, tunnel servers were affected. The R60 build was also put on api.q. that later led to cross-connection between different conduits. The architects are reviewing all code in case we need revert and/or make configuration changes. 2. DevOps is continuing to clear jobs stuck in the queue. They are doing this incrementally so as not to overwhelm the tunnel servers as they begin to run. 3. We determined tunnels are misconfigured causing performance issues. We are fixing and changing tunnel servers out. This should fix the tunnel issues 4. The issue with python connectivity in VPC environments has been resolved and we are monitoring. We are seeing intermittent connectivity issues from the tunnel servers to the metastore for various customers. We believe that once we finish addressing #3 that this will be resolved. 5. The scheduler does nothing but run the job schedule and does not execute code. The architects are currently comparing the code in the scheduler on API with the code on US, which is working fine, to see what the code differences are.
  • Time: March 16, 2022, 7:16 p.m.
    Status: Identified
    Update: Devops Teams have shared some of the most recent findings and ongoing activities to resolve the issue. 1. The suspected cause of api.q outage appears to be moving Scheduler tier from classic (non-vpc) to vpc by DevOps on 8th March as part of R60 rollout preparation. Since the scheduler stopped working from 8th March, other infra components failed, tunnel servers were affected. The R60 build was also put on api.q. That later led to cross-connection between different conduits. The architects are piecing together any code that was not reverted back and/or any configuration changes that need to be reverted. 2. DevOps is clearing jobs that were stuck in the queue. They are doing this incrementally so as not to overwhelm the tunnel servers as they begin to run. 3. We determined that due to over rotation of tunnels that all tunnels are misconfigured. This is being addressed and should fix the tunnel issues. 4. The connection issue with python is due to connectivity issues with customers using private subnet. They can use a non-VPC or a VPC. It appears to be only VPC connections. We have moved python expertise to triage and solve this issue. 5. The scheduler does nothing but run the job schedule and does not execute code. The architects are currently comparing the code in the scheduler on API with the code on US, which is working fine, to see what the code differences are.
  • Time: March 16, 2022, 2:57 p.m.
    Status: Identified
    Update: The Devops team has cleared all the stuck jobs which were in the submitted state from March 8th Onwards. The Team is currently monitoring the requeued jobs which were under processing. To debug the intermittent job failure issue, team has put loggers on the code which was giving error. Team is also analyzing the RDS logs to check if any configuration change is required.
  • Time: March 16, 2022, 11:33 a.m.
    Status: Identified
    Update: The commands submitted manually are working fine. The Devops team performed the checks on the Tunnels and Nodes. The team discovered multiple jobs which were stuck in Scheduler, the team is Clearing all the stuck jobs.
  • Time: March 16, 2022, 7:53 a.m.
    Status: Identified
    Update: A few more errors of cross-connection between different conduits got detected. The technical team is debugging further.
  • Time: March 16, 2022, 4:12 a.m.
    Status: Identified
    Update: DevOps team is still working on the root cause of the issue and trying to resolve it soon.
  • Time: March 16, 2022, 1:25 a.m.
    Status: Identified
    Update: Our DevOps found some underlying issue that is causing the failure of the intermittent job and looking to find the root cause. DevOps team is actively monitoring all queue jobs and the jobs seem to queuing and ending fine.
  • Time: March 15, 2022, 10:48 p.m.
    Status: Identified
    Update: Qubole has been experiencing periodic outages on api.qubole. We are working to resolve this. We have resolved most issues, but if you are still experiencing issues with scheduled jobs not starting and finishing file a ticket with support. Below is a list of what has been done so far as well as our next plan of action to stabilize the platform. 1. Issue: The memcache connectivity from worker nodes was lost Resolution: Fixed the issues with the configuration and replaced the worker nodes. This fixed the major issues of jobs getting stuck Status : Done 2. Issue: Some of the worker and discovery tier nodes were trying to connect to VPC based Redis server and were failing . Resolution: Team fixed this issue by pointing the Redis server DNS to non VPC based Redis and resolved further issues in jobs getting stuck Status : Done 3. Issue: Common tunnels and dedicated tunnels started giving error due to heavy load built up due to jobs pileup. Resolution: The common tunnels and dedicated tunnels were replaced with new tunnel machines to ease the traffic. Status : Done 4. Issue: Chef Run is failing to execute few shell commands in the scheduler node. Resolution: Trying to re-run the Chef client manually, Issue fixed. Status : Done 5. Issue: Python connection issue Resolution: There still seems to be an underlying issue which is causing the intermittent jobs failure. Status : In-Progress 6. Issue: Cleanup of stuck jobs and clusters Resolution: Various customers are still seeing intermittent issues. These are due to cleanup needing to be done on a customer basis. We are waiting for further response from these customers Status : In-Progress 7. Issue: 100% disc space full Resolution: We have rotated the worker, client and web app tier nodes which were facing space issues Status : Done 8. Issue: Cleared jobs stuck in processing since March 7th Resolution: Changes the status to canceled Status : Done 9. Issue: Found Python 3.8 errors Resolution: Found the error in logs Status : In-progress. 10.Issue: Monitoring all jobs Resolution: DevOps team is actively monitoring all queued jobs Status : In-progress. . Current Action Items: -Debug the remaining customer issues and resolve asap -Monitoring the application health and quickly replace the bad tunnels (temporary measure till we resolve issue #6) -Working closely with Qubole Support team to quickly address the customer concerns related to specific clusters and critical jobs
  • Time: March 15, 2022, 7:47 p.m.
    Status: Identified
    Update: Our DevOps found some underlying issue that is causing the failure of the intermittent job and looking to find the root cause. DevOps team is actively monitoring all queue jobs and the jobs seem to queuing and ending fine.
  • Time: March 15, 2022, 5:49 p.m.
    Status: Identified
    Update: Our DevOps found some underlying issue that is causing the failure of the intermittent job and looking to find the root cause. DevOps team is actively monitoring all queue jobs and some of the jobs seem to queuing and ending fine.
  • Time: March 15, 2022, 2:50 p.m.
    Status: Identified
    Update: Our DevOps found some underlying issue that is causing the failure of the intermittent job and looking to find the root cause. DevOps team is actively monitoring all queue jobs and the jobs seem to queuing and ending fine.
  • Time: March 15, 2022, 11:44 a.m.
    Status: Identified
    Update: DevOps Team has identified jobs stuck in processing. Team is working to clear the stuck jobs. Team has rotated the worker, client, and webapp tiers node which were facing some space issues.
  • Time: March 15, 2022, 7:56 a.m.
    Status: Identified
    Update: The support team and the DevOps team are actively working on this and we are very close to resolving this issue. On the back side we are replacing the bad node and trying to fix it as soon as possible.
  • Time: March 15, 2022, 5:52 a.m.
    Status: Identified
    Update: The support team and the DevOps team are actively working on this and we are very close to resolving this issue. On the back side we are replacing the bad node and trying to fix it as soon as possible.
  • Time: March 15, 2022, 3:30 a.m.
    Status: Identified
    Update: All the clusters are starting fine manually. Cluster start up fails intermittently when scheduled command triggers cluster to start. Now Devops is working on this particular issue.
  • Time: March 15, 2022, 12:55 a.m.
    Status: Identified
    Update: All the clusters are starting fine manually. Cluster start up fails intermittently when scheduled command triggers cluster to start. Now Devops is working on this particular issue.
  • Time: March 14, 2022, 9:50 p.m.
    Status: Identified
    Update: All the clusters are starting fine manually. Cluster start up fails intermittently when scheduled command triggers cluster to start. Now Devops is working on this particular issue.
  • Time: March 14, 2022, 7:15 p.m.
    Status: Identified
    Update: All the clusters are starting fine manually. Cluster start up fails intermittently when scheduled command triggers cluster to start. Now Devops is working on this particular issue.
  • Time: March 14, 2022, 4:21 p.m.
    Status: Identified
    Update: All the clusters are starting fine manually. Cluster start up fails intermittently when scheduled command triggers cluster to start. Now Devops is working on this particular issue
  • Time: March 14, 2022, 1:16 p.m.
    Status: Identified
    Update: All the clusters are starting fine manually. Cluster start up fails intermittently when scheduled command triggers cluster to start. Now Devops is working on this particular issue.
  • Time: March 14, 2022, 10:12 a.m.
    Status: Identified
    Update: DevOps team is actively working on the issue with individual customers and trying to resolve it at the earliest.
  • Time: March 14, 2022, 8:49 a.m.
    Status: Investigating
    Update: DevOps team has identified the cause of the issue as scheduler autoscaling that is contributing to the remaining intermittent issues. They are currently working to resolve it.

Check the status of similar companies and alternatives to Qubole

Smartsheet
Smartsheet

Systems Active

ESS (Public)
ESS (Public)

Systems Active

ESS (Public)
ESS (Public)

Systems Active

Cloudera
Cloudera

Systems Active

New Relic
New Relic

Systems Active

Boomi
Boomi

Systems Active

AppsFlyer
AppsFlyer

Systems Active

Imperva
Imperva

Systems Active

Bazaarvoice
Bazaarvoice

Issues Detected

Optimizely
Optimizely

Systems Active

Electric
Electric

Systems Active

ABBYY
ABBYY

Systems Active

Frequently Asked Questions - Qubole

Is there a Qubole outage?
The current status of Qubole is: Systems Active
Where can I find the official status page of Qubole?
The official status page for Qubole is here
How can I get notified if Qubole is down or experiencing an outage?
To get notified of any status changes to Qubole, simply sign up to OutLogger's free monitoring service. OutLogger checks the official status of Qubole every few minutes and will notify you of any changes. You can veiw the status of all your cloud vendors in one dashboard. Sign up here