Last checked: 7 seconds ago
Get notified about any outages, downtime or incidents for ConexED and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for ConexED.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Admin Pages | Active |
Appointment Management System | Active |
Archiving System | Active |
Business Intelligence | Active |
Case Managment | Active |
Classrooms | Active |
ConexED University | Active |
Cranium Cafe | Active |
Document Library | Active |
Kiosk/Hub | Active |
Lobby & Chat Features | Active |
REST API and LTI | Active |
Student Support Directory | Active |
Whiteboard | Active |
View the latest incidents for ConexED and check for official updates:
Description: Additional capacity has been added and traffic has been re-routed to reduce load.
Status: Resolved
Impact: Minor | Started At: May 18, 2020, 8:31 p.m.
Description: The system is back online. The issue was caused by our in memory REDIS database hitting a capacity max. We added more capacity and alarms to prevent this from happening in the future.
Status: Resolved
Impact: Major | Started At: May 18, 2020, 4:38 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Maintenance | Started At: April 13, 2020, 8:31 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: None | Started At: March 26, 2020, 7 p.m.
Description: **Summary** Users worldwide experienced 504 gateway errors between 2:00 PM MST March 24th and 4:30PM MST March 24th, 2019. **Overview** Heavy system authentication usage caused a database table lock which cascaded to service denials across the system. While user authentication was down, users already in meetings and whiteboards were able to use those portions of the system. However, if those users refreshed their browser, they were denied service because the authentication system could not be reached. The system was partially restored at 4PM MST and fully restored at 6PM MST after service dependency clean-up was manually implemented. **Details** System peak usage is normally around 1PM MST to 4PM MST. Unprecedented levels of authentication requests caused a system lock up in one of our real-time session databases around 1:50PM MST. As that database slowed to a crawl, load was automatically directed to read-only backup databases. These databases soon failed and caused a total lockup of the system at 2:10PM resulting in 504 gateway errors. The issue was quickly identified, but could not be quickly resolved. As the NetOps team restored service, the authentication system was again quickly overwhelmed, crashed and resulted in additional 504 gateway errors. A reboot of the authentication database proved successful at first, but as the flood of users entered back in, the authentication system locked up again. After examining portions of the authentication code, our engineering team found that under extremely heavy load, an unnecessary recursive loop was called, resulting in the lockup of the system. This authentication code was rewritten, but it took time to test and deploy. The new code was tested around 3:30PM and deployed at 4PM. The new code is working. The systems are handling 2x load with a 10x increase in performance. However, as the systems came back online at 4PM, several dependencies were decoupled causing inaccuracies in reporting what and where meetings were running. The meeting database had to be audited and manually updated system wide. New meetings could be created, but some users may have experienced a "This meeting has ended" error during this period. This manual update was finished at 5:50PM MST. The system was fully restored at 6PM MST . **Mitigation** Although the incident’s cause and solution were rapidly identified by our NetOps team, the issue was preventable, and could have been identified and fixed sooner. Our teams have conducted extensive retrospective analysis and have identified a number of actions that will prevent issues like this in the future and provide faster resolution: • Adding DNS fail-safes to quickly route all traffic requesting authentication to a more friendly system outage page. Redirecting all traffic during an outage will enable engineering to work on a quiet system and provide solutions for incident resolution faster. • Removing several tightly coupled dependencies on the authentication system to allow the system to self-heal in critically overloaded situations. • Further examination and refactoring of authentication code to be more performant and to handle critical loads gracefully. • Making additions to our automated alerts to provider faster, failsafe notifications directly to our engineering teams. • Improve auto-scaling on our authentication system such that adding additional capacity is automated and requires no manual intervention. **Conclusion** We understand any challenges encountered while using ConexED are frustrating, and they impact your ability to serve your students, tutors, faculty, staff and teachers. We are taking what we have learned from this incident to improve how we detect, identify and handle these issues in the future. We are deeply sorry for the impact this had on your ConexED users.
Status: Postmortem
Impact: Critical | Started At: March 24, 2020, 7:39 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.