Last checked: 3 minutes ago
Get notified about any outages, downtime or incidents for Pingboard and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Pingboard.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
View the latest incidents for Pingboard and check for official updates:
Description: Heroku has reported that their upstream provider has applied a fix. We will continuing to monitor.
Status: Resolved
Impact: None | Started At: Nov. 8, 2023, 4:02 p.m.
Description: Earlier today there was an outage for much of the Pingboard application. We know how disruptive this can be and we apologize for the impact on you and your operations. In addition to the following description of the cause of the incident, we have identified several next steps outlined below which describe our efforts to prevent any similar issues in the future. ### What was the impact of this incident? This was a partial outage of the Pingboard application with a duration of approximately 5 hours from ~2AM Central time to ~7AM Central time. The outage affected most of the pages of the Pingboard application. Unaffected areas included user profile pages, org chart page, unauthenticated web pages, and most API endpoints. Users with already valid sessions also had more access and were able to navigate more of the site. ### Why was the site unavailable? The Pingboard application uses a Redis database to support many of it’s functions. At approximately 2AM we lost connectivity to that database, at first intermittently, then completely. Those aspects of the application which rely on the Redis database started to return errors. ### Why did the connectivity to Redis fail? We have contacted our vendor for additional details and will update this as we find out more, but we do have some insight based on how the incident was resolved. Our Redis instance was an older version which is approaching it’s EOL date. The new version requires a different method of connectivity. Prior versions used stunnel to encrypt traffic between the application and Redis, newer versions support direct secure TLS connections. We had a version upgrade scheduled and planned for the upcoming weekend, prior to the EOL date. It appears there was some change on the vendor side which caused the existing method of connectivity to start failing ahead of the planned changeover, and that triggered the incident. ### Why didn’t an alert go off when the incident began? Pingboard’s traffic varies widely by time of day. At the time the incident began, the ratio of traffic that was failing to that which was succeeding was not great enough to trigger an alert. Alerts are triggered when > 5% of traffic results in an error. The vast majority of the traffic at that time are automated API requests which were succeeding. As more human traffic came on starting around 6AM Central time, the ratio of failing requests exceeded our threshold and alerts fired. ### How was the problem addressed? Once the incident was identified as being related to Redis connectivity, the decision was made to proceed with the version upgrade ahead of schedule and roll out the new method of connectivity. This was successful in resolving the issue. ### What follow up actions have been identified? * Add a health check and alert specifically for Redis connectivity. * Lower the threshold for total errors to trigger alerts. The threshold may require some adjustment to try and make sure we don’t end up with too many false alarms, but initial research shows that a level of 3% would have triggered alerts promptly for this incident, so we have taken that action immediately. * An analysis of ways we can limit the impact of any future Redis outages will be performed. Just as all areas of the application were not affected, it may be possible to identify other areas where the impact could be minimized so that the application is more tolerant of this type of failure.
Status: Postmortem
Impact: Major | Started At: May 17, 2023, 7 a.m.
Description: Earlier today there was an outage for much of the Pingboard application. We know how disruptive this can be and we apologize for the impact on you and your operations. In addition to the following description of the cause of the incident, we have identified several next steps outlined below which describe our efforts to prevent any similar issues in the future. ### What was the impact of this incident? This was a partial outage of the Pingboard application with a duration of approximately 5 hours from ~2AM Central time to ~7AM Central time. The outage affected most of the pages of the Pingboard application. Unaffected areas included user profile pages, org chart page, unauthenticated web pages, and most API endpoints. Users with already valid sessions also had more access and were able to navigate more of the site. ### Why was the site unavailable? The Pingboard application uses a Redis database to support many of it’s functions. At approximately 2AM we lost connectivity to that database, at first intermittently, then completely. Those aspects of the application which rely on the Redis database started to return errors. ### Why did the connectivity to Redis fail? We have contacted our vendor for additional details and will update this as we find out more, but we do have some insight based on how the incident was resolved. Our Redis instance was an older version which is approaching it’s EOL date. The new version requires a different method of connectivity. Prior versions used stunnel to encrypt traffic between the application and Redis, newer versions support direct secure TLS connections. We had a version upgrade scheduled and planned for the upcoming weekend, prior to the EOL date. It appears there was some change on the vendor side which caused the existing method of connectivity to start failing ahead of the planned changeover, and that triggered the incident. ### Why didn’t an alert go off when the incident began? Pingboard’s traffic varies widely by time of day. At the time the incident began, the ratio of traffic that was failing to that which was succeeding was not great enough to trigger an alert. Alerts are triggered when > 5% of traffic results in an error. The vast majority of the traffic at that time are automated API requests which were succeeding. As more human traffic came on starting around 6AM Central time, the ratio of failing requests exceeded our threshold and alerts fired. ### How was the problem addressed? Once the incident was identified as being related to Redis connectivity, the decision was made to proceed with the version upgrade ahead of schedule and roll out the new method of connectivity. This was successful in resolving the issue. ### What follow up actions have been identified? * Add a health check and alert specifically for Redis connectivity. * Lower the threshold for total errors to trigger alerts. The threshold may require some adjustment to try and make sure we don’t end up with too many false alarms, but initial research shows that a level of 3% would have triggered alerts promptly for this incident, so we have taken that action immediately. * An analysis of ways we can limit the impact of any future Redis outages will be performed. Just as all areas of the application were not affected, it may be possible to identify other areas where the impact could be minimized so that the application is more tolerant of this type of failure.
Status: Postmortem
Impact: Major | Started At: May 17, 2023, 7 a.m.
Description: Our image provider has resolved the issue.
Status: Resolved
Impact: None | Started At: Jan. 20, 2023, 4:55 p.m.
Description: Our image provider has resolved the issue.
Status: Resolved
Impact: None | Started At: Jan. 20, 2023, 4:55 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.