Last checked: 8 minutes ago
Get notified about any outages, downtime or incidents for Fasterize and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Fasterize.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Acceleration | Active |
API | Active |
CDN | Active |
Collect | Active |
Dashboard | Active |
Logs delivery | Active |
Website | Active |
View the latest incidents for Fasterize and check for official updates:
Description: English version follows. # Description Entre 8h53 et 9h52, la couche de front a été surchargée suite au mauvais redémarrage automatique de certaines machines. Pendant cette période, seul un nombre restreint de machines ont assuré le trafic. Le trafic n’a été rerouté que quelques minutes malgré les sondes rapportant l’indisponibilité de la plateforme. # Faits et timeline * 6h30 : renouvellement automatique des certificats Let’s Encrypt * 6h30 : démarrage des machines pour la journée * 6h35 : retour en erreur du démarrage des fronts, toujours pas dispos sur le load-balancer * 8h27 : première alerte sur la bande passante, automatiquement résolue à 8h32 * 8h45 : seconde alerte sur la bande passante, automatiquement résolue à 8h51, l’équipe essaye de démarrer de nouvelles machines * 8h51 : 3ème alerte sur la bande passante, automatiquement résolue à 9h * 8h53 : la couche de fronts commence à être surchargée au niveau réseau * 9h01 : premier ticket au support * 9h19 : alerte sur l’indispo globale => désactivation globale de Fasterize * 9h20 : identification du problème sur les machines défectueuses * 9h25 : première communication sur [status.fasterize.com](http://status.fasterize.com) * 9h35 : première tentative de déploiement du fix et échec * 9h45 : lancement du déploiement du fix * 9h52 : fin du déploiement et restart des machines # Analyse Tous les jours, l’infrastructure Fasterize est ajustée en terme de trafic et des machines sont éteintes ou démarrées en fonction. A 6h30, les machines front ont été démarrées normalement mais le service HTTP n’a pas correctement démarré. Le démarrage du service HTTP a été rendu impossible par un problème de configuration lié au renouvellement automatique des certificats Let’s Encrypt : le service HTTP n’avait plus accès à la clef privée des certificats renouvelés pendant la nuit et refusait de démarrer. Cet accès a été rendu impossible à cause du nouveau mécanisme de renouvellement des certificats impliquant des droits différents sur les certificats et clefs privées et suite au renouvellement automatique effectué à 06h30. Le load-balancer a bien vu les machines démarrées mais les voyait en unhealthy. Les machines restantes ont donc pris tout le trafic et ont commencé à être surchargées après avoir atteint leur capacité maximale. La couche de CDN a donc eu du mal à joindre l’origine et a provoqué des erreurs 50x. La disponibilité de la couche d’optimisation mesurée par les sondes externes montre bien l’indisponibilité à partir de 8h52 alors que les sondes de disponibilité globale montre une indisponibilité seulement pendant 3 minutes à partir de 9h18. Les origines clients sont branchés sur les sondes globales \(incluant CDN et couche d’optimisation\) et donc n’ont pas été reroutées à partir du début de l’incident. Les sondes de disponibilité au niveau global étaient paramétrées avec une sensibilité moindre et comme une partie du trafic continuait de passer, elles n’ont pas relevée la même indisponibilité. Les alertes remontées à l’astreinte ont concerné le trafic réseau excessif mais pas les erreurs 504 car le taux d’erreur moyen de 504 n'a pas dépassé les seuils classiques que nous utilisons, au plus fort de l’incident vers 9h17. Aucune alerte n’a été remontée sur la non disponibilité du service HTTP des fronts. # Métriques * Niveaux de sévérité de l'incident : * Sévérité 2 : dégradation du site, problème de performance et/ou feature cassée avec difficulté de contourner impactant un nombre significatif d'utilisateur * Temps de détection : 2h \(à partir du démarrage des fronts\) * Temps de résolution : 3h * Durée de l’incident : 60 minutes # Impacts Sur la durée de l’incident, les erreurs 50x ont représenté 10,77% du trafic des pages HTML, 3,52% du trafic non caché et 1,15% du trafic total. Au plus fort de l’incident \(9h17\), ces taux sont montés à respectivement 38,7%, 16,3% et 5,5% Onze clients ont remonté des erreurs via le support. # Plan d'action \[ \] à planifier, \[-\] en cours, \[x\] fait **Court terme :** * \[-\] Modification du mécanisme de synchronisation des certificats Let’s Encrypt * \[ \] Améliorer la remontée d’informations \(logs et alertes\) en cas de problèmes lors du renouvellement et/ou de la synchro * \[x\] Corriger la sensibilité de la sonde de disponibilité au niveau global * \[x\] Organisation : débranchement systématique de la plateforme en cas d’incident impactant l’ensemble des clients * \[-\] Test du débranchement manuel sur un environnement de staging * \[x\] Revoir les seuils d'alerte sur les 504 vus par Cloudfront * \[-\] Ajouter une alerte sur la disponibilité du service HTTP pour les fronts * \[ \] Organisation : améliorer le temps de réaction avant publication d’un incident **Moyen terme :** * Améliorer la résilience des fronts face à un certificat SSL invalide / absent. **Long terme :** * Revue du système de gestion des certificats SSL. # English version # Description Between 8:53 and 9:52 a.m \(UTC\+2\)., the front layer was overloaded due to the poor automatic restart of some machines. During this period, only a limited number of machines were running. Traffic was only re-routed for a few minutes despite availability probes reporting the unavailability of the platform. # Facts and timeline * 6:30 am: automatic renewal of Let's Encrypt certificates * 6:30 am: start of the machines for the day * 6h35: error return from the start of the fronts, still not available on the load-balancer * 8:27 am: first bandwidth alert, automatically resolved at 8:32 am * 8:45 am: second alert on the bandwidth, automatically resolved at 8:51 am, the team tries to start new machines * 8:51 am: 3rd alert on bandwidth, automatically resolved at 9 am * 8:53 am: the front layer begins to be overloaded at network level * 9:01 am: first ticket to the support desk * 9h19 : global availability alert => global disabling of Fasterize * 9:20 am: identification of the issue on defective machines * 9:25 am: first communication on [status.fasterize.com](http://status.fasterize.com) * 9:35 am: first attempt to deploy a fix and failure * 9:45 am: second start of fix deployment * 9:52 am: end of deployment and restart of the machines # Analysis Every day, Fasterize infrastructure is adjusted in terms of traffic and machines are switched off or started up accordingly. At 6:30 am, the front machines were started normally but the HTTP service did not start correctly. Starting the HTTP service was made impossible by a configuration problem related to the automatic renewal of Let's Encrypt certificates: the HTTP service no longer had access to the private key of the certificates renewed during the night and refused to start. This access was made impossible because of the new certificate renewal mechanism involving different rights on certificates and private keys and following the automatic renewal performed at 06:30. The load-balancer did see the machines started but saw them as unhealthy. So the remaining machines took all the traffic and started to be overloaded after reaching their maximum capacity. The CDN layer therefore had trouble reaching the origin and caused 50x errors. The availability of the optimization layer measured by the external probes shows unavailability from 8:52 am while the global availability probes show unavailability only for 3 minutes from 9:18 am. The customer origins are connected to the global probes \(including CDN and optimization layer\) and therefore were not rerouted from the beginning of the incident. The global availability probes were set up with a lower sensitivity and as some traffic continued to pass, they did not detect the same unavailability. The alerts raised on call concerned the excessive network traffic but not the 504 errors because the average error rate of 504 did not exceed the classic thresholds that we use. No alert was raised on the non-availability of the HTTP service of the fronts. # Metrics * Incident Severity Levels : * Severity 2: site degradation, performance problem and/or feature broken with difficulty to bypass impacting a significant number of users. * Detection time: 2h \(from the start of the edges\) * Resolution time: 3h * Duration of the incident: 60 minutes # Impacts Over the duration of the incident, 50x errors accounted for 10.77% of the HTML page traffic, 3.52% of the non-cached traffic and 1.15% of the total traffic. At the peak of the incident \(9:17 a.m.\), these rates rose to 38.7%, 16.3% and 5.5% respectively. Eleven customers reported errors via support. # Action plan \[ \] planned, \[-\] doing, \[x\] done **Short term :** * \[-\] Modification of the Let's Encrypt certificate synchronization mechanism * \[ \] Improve the feedback \(logs and alerts\) in case of problems during renewal and/or synchronization. * \[x\] Correct the sensitivity of the availability probe at the global level * \[x\] Organization: systematic disconnection of the platform in the event of an incident impacting all customers. * \[-\] Test manual disconnect in a staging environment * \[x\] Review the alert thresholds on the 504 seen by Cloudfront. * \[-\] Add an alert on the availability of HTTP service for the fronts. * \[ \] Organization: improve response time before publication of an incident **Medium term :** * Improve the resilience of the fronts against an invalid/absent SSL certificate. **Long term :** * Review of the SSL certificate management system.
Status: Postmortem
Impact: Minor | Started At: June 18, 2020, 7:25 a.m.
Description: Optimizations are now enabled.
Status: Resolved
Impact: Minor | Started At: June 4, 2020, 1:42 p.m.
Description: # Description Between 10:18 UTC\+2 and 11:20 a.m. UTC\+2, the static resources of some clients responded with 503 errors. Internet users did not necessarily see these errors, but some sites may have displayed broken pages because of these missing objects, especially for Internet users who did not have these objects in their browser cache. # Facts and Timeline * 10:18: manual update of one of our component * 10:28: first alert * 10:36: Start of bypass of the CDN layer for the impacted domains * 10:52: All impacted domains bypass the CDN layer. Due to some DNS propagation delays, errors occur until 11:20 * 13:42: Start of reconnection of impacted domains to the CDN * 14:04: Impacted domains are reconnected to the CDN # Analyze The incident was caused by an update on one of our component, not supposed to be related to the production stack. An execution role needed by edge processes on the CDN layer was removed as a side-effect of this update. # Metrics Severity: level 2 \(site degradation, performance problem and/or feature broken with difficulty to bypass impacting a significant number of users\) Time To Detect: 10 min Time To Resolve: 60min # Impacts Only a few sites were impacted \(<10\). # Countermeasures * Short-term * adjust alerting on edge processes to improve diagnosis * adjust alert level on 5xx errors viewed from the CDN layer * Mid-Term * secure the execution role of edge processes * ease CDN layer unplugging for a specific customer
Status: Postmortem
Impact: Minor | Started At: May 22, 2020, 9:01 a.m.
Description: The fix has been deployed. The image inlining is enabled again. We are sorry for the inconvenience.
Status: Resolved
Impact: Major | Started At: May 7, 2020, 8:16 a.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Critical | Started At: April 30, 2020, 6:20 a.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.