Last checked: 9 minutes ago
Get notified about any outages, downtime or incidents for Rippling and 1800+ other cloud vendors. Monitor 10 companies, for free.
Outage and incident data over the last 30 days for Rippling.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage. It's completely free and takes less than 2 minutes!
Sign Up NowOutlogger tracks the status of these components for Xero:
Component | Status |
---|---|
Rippling App | Active |
Device & Inventory Management | Active |
Devices app | Active |
Identity & App Management | Active |
Authentication | Active |
Platform API | Active |
RADIUS | Active |
RPass | Active |
Single Sign-on (SSO) | Active |
Third-party integrations | Active |
VLDAP | Active |
Unity | Active |
Workflow Automator | Active |
View the latest incidents for Rippling and check for official updates:
Description: This issue has been resolved.
Status: Resolved
Impact: Major | Started At: Oct. 1, 2024, 2 p.m.
Description: The incident has been fully resolved.
Status: Resolved
Impact: Minor | Started At: Sept. 27, 2024, 7:39 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Major | Started At: Sept. 9, 2024, 5:38 p.m.
Description: ## Overview Access to third-party apps was impaired or deprovisioned for a subset of customers and their employees from June 29 2024 12:30 AM PDT to July 1 2024 12:22 PM PDT. The problem resulted from a code change that caused an unintentional update to third-party app groups, emptying their membership lists during an automated nightly sync. These changes had a significant impact on third-party apps, leading to a loss of access in many instances. When the affected group belonged to a large access group in a third-party identity provider \(IdP\), many employees were deprovisioned from their assigned applications. No unauthorized data access occurred during or as a result of this incident. We sincerely apologize for any inconvenience this incident may have caused to our customers. At Rippling, we take our responsibility for the stability and reliability of our systems very seriously. Our team is fully committed to conducting a thorough and comprehensive analysis of this incident to understand its root causes and to implement improvements to our systems and processes. We are dedicated to learning from this experience and taking all necessary steps to prevent this type of incident from happening again in the future. ## Timeline | **Timestamp** | **Event** | **Elapsed time \(hours\)** | | --- | --- | --- | | June 29 2024 12:30 AM PDT | Incident began \(nightly sync starts\) | 0 | | June 29 2024 05:07 AM PDT | Detection of customer impact | 4.5 | | June 29 2024 06:12 AM PDT | Code change reverted | 5.75 | | June 29 2024 07:01 AM PDT | Mitigation; no more groups impacted | 6.5 | | June 29 2024 07:44 AM PDT | Statuspage incident created | 7.25 | | June 29 2024 09:01 AM PDT | Remediation start for affected customers | 8.5 | | June 29 2024 09:59 AM PDT | In-app banner created notifying all users | 9.5 | | June 30 2024 05:30 AM PDT | 90% of affected groups remediated | 29 | | June 30 2024 03:56 PM PDT | Impacted customer email sent | 38.5 | | June 30 2024 06:30 PM PDT | 95% of affected groups remediated | 42 | | July 1 2024 12:22 PM PDT | 99% of affected groups remediated | 60 | | July 3 2024 09:01 AM PDT | Incident considered resolved | 104.5 | ## Root Cause ### Background Rippling enables users to synchronize third-party groups to groups in Rippling using Supergroups. Groups contain rules, which are continuously evaluated to a membership list. An example of a rule is “All - everyone,” which would evaluate to all active employees. A group can also include a static list of individual users independently or in combination with other rules. When membership of a group connected to a third-party group is changed, Rippling calculates the delta \(the difference in membership\) and updates the third-party group to match. A nightly sync detects new groups in the third-party and creates them in Rippling to ensure Rippling administrators can automatically change third-party app access during hiring, transition or termination workflows. ### Root cause The problem was caused by a code change to our nightly sync module on Friday June 29 2024 at 8:07 PM PDT meant to prepare for an upcoming feature. This change was meant to run only against newly created groups that Rippling imported from third party apps in our nightly sync; instead it ran against all groups, including existing ones that had defined rules. Any existing rule in the group was removed, and the result was a group with 0 members being propagated to the third-party app group. When the nightly sync started \(at 12:30 AM PDT, “incident start” in the timeline\), the problematic code change removed any existing rule in the group, turning a group like “All Employees” that had 500 members to one that had 0 members. The system then detected the change in membership, and immediately removed 500 members from the third party via an API call. This occurred to every group that was processed by the nightly sync. #### Example sequence of events | **Event description** | **Group’s rule** | **Members \(Rippling\)** | **Members \(App\)** | | --- | --- | --- | --- | | Prior to nightly sync | `All - everyone` | 500 | 500 | | Nightly sync starts | `All - everyone` | 500 | 500 | | Group reaches problematic code and deletes the group’s rule | – | 0 | 500 | | New group membership detected; updates third-party app via API | – | 0 | 0 | ### Impact During our nightly sync, which ran from June 30 2024 at 12:30 AM PDT to June 30 2024 7:01 AM PDT when it was manually stopped, Rippling’s systems made API calls to third-party apps to remove users from groups in the third-party. _This led to some employees and customers losing access to critical third party apps and services._ * When affecting a group tied to admin permissions in the third-party app, some customers were locked out of their account because there were no more admins * When affecting a group in an identity provider, deprovisioning or other changes occurred due to changes cascading through its apps * When affecting a group tied to product access such as Atlassian Trello, the third-party app deleted the user’s application data ## Resolution The problematic pull request was reverted at June 29 2024 6:12 AM PDT and the incident team started to terminate all nightly syncs to prevent further impact, stopping them all by June 29 2024 7:01 AM PDT. * An incident team worked to detect the affected groups and restore the rule that was deleted using a database snapshot created prior to incident start * A separate incident team created an auditing system that compared the membership of a group in Rippling to the group in the third-party Each affected group was restored from the database snapshot, and the system subsequently updated the third-party app’s group. Then, the auditing tool compared the group membership between Rippling and the third-party app. All impacted customers were emailed the specific apps and groups impacted by the incident by June 30 2024 03:56 PM PDT. More than 90% of groups had been remediated at that point. 99% of groups were restored by July 1 2024 12:22 PM PDT. 0.06% of groups could not be restored due to an authorization issue, where Rippling customer support worked with the customer to resolve. ## Action Items We are committing to the following immediate action items: * The code for the nightly job module has been frozen from further modification. We have implemented a change management process that will restrict code changes to this part of our system until we've completed our detailed root cause analysis and implemented system safeguards to prevent a repeat of this class of incident. * Updates to the nightly sync module will be forced to go through a gradual \(phased\) rollout * Infrastructure to detect and prevent updates to third-party apps that cause deprovisioning or removal from groups across multiple companies based on defined thresholds. This acts as a global circuit breaker to ensure impact is limited, but a single company can still be impacted. Rippling will be publishing a long-form root cause analysis to be shared with customers by July 12, 2024 and a public blog post reviewing this incident and detailing progress on these action items and longer term improvements. | July 12, 2024 Addendum | | --- | This is an addendum to the postmortem published on [Rippling’s status page](https://status.rippling.com/incidents/hm6vf6ct17p6). It outlines the events that caused this incident, Rippling’s response, what worked or failed, and actions we’ve taken since the incident and ones we’re committing to in the coming weeks and months. ## Intended behavior: Supergroups for third-party apps Group management using [Supergroups](https://www.rippling.com/policies) is an essential identity and access management feature. It allows automation of access and permissions in third-party apps like Google Workspace and GitHub, where you can define that all employees in the Infrastructure Engineering department are supposed to be in the `[email protected]` Google Group and the `infra` GitHub team. When Rippling detects changes to these rules, our system automatically calculates which API calls should be made to those apps and executes them within seconds of the original change. The notable gap in this feature is while you can control third-party app _group_ _membership_ with Supergroups, you cannot yet control _access_ to the app itself. * This allows for significant automation without _any custom integration_: such as a complex contractor access rule like “only provide access to Google Workspace if the contractor has passed their Checkr background check, completed a security assessment in Rippling LMS, and has been confirmed as an active employee by their manager” While working on this feature, Rippling unintentionally triggered this incident. ## The change that caused the incident Prior to the start of the incident, Rippling had a code-freeze. This is a common practice for SaaS companies due to significant activity occurring in sales at the end of the month. This means that no code was deployed to production until June 28 2024 7:43 PM PST. The code change that caused the incident happened to be part of this deployment, but was actually merged a full day prior. The problematic code change itself was 4 lines of code, but the development process was complicated by the fact that there were multiple people working on the change. 1. A developer put up a _working_ code change that was rejected by a test that was no longer relevant. 1. Prior to this, there was a version of the code that was problematic \(what would eventually start the incident\). 2. In a pair programming session, another developer pulled the problematic version of the code, but since the session was reviewing the unrelated test, the problematic part of the code was not reviewed. 1. The problematic code was pulled because the local code was not up to date 3. After resolving the issue with the test that was no longer relevant, the problematic code was accidentally included in the code change, which enabled the start of the incident While the change with the problematic code included a set of tests, they were not sufficient. ## Rippling’s response ### Detection Rippling engineering determined that this was a customer-facing incident 4.5 hours after the incident began on Saturday, June 29th at 5:07 AM PDT when the engineering team responsible for the feature under development was paged. Our observability and monitoring systems account for the _lack of API calls being made_ \(as it would indicate an outage of our product\) but did not test for the _aggregate_ _behavior_ made across those API calls. ### Mitigation Mitigation of impact to additional customers occurred two hours after detection. During this time, our focus was to limit the reach of the impacted code. Rippling was also impacted by this incident; the incident response team’s access was revoked to the necessary tools to stop our systems from making additional API calls. The incident response team had to trigger a “break-glass” policy, but access to those instructions was also revoked. Once the team confirmed that further impact was halted, multiple incident response teams were created to handle mitigation for impacted customers: * **Recall & restore :** A team to \(1\) _find_ all impacted groups and \(2\) _restore_ them to their previous state * **Push & audit:** A team to \(3\) _push_ and \(4\) _audit_ restored groups between Rippling and the third-party to confirm matching membership | **Step 1** | **Step 2** | **Step 3** | **Step 4** | | --- | --- | --- | --- | | Find | Restore | Push | Audit | Only groups that went through each of these steps successfully were considered mitigated. #### 90% Mitigation It took about ~24 hours for Rippling to mitigate 90% of affected groups. In the beginning, the recall and restore team was determining which groups were affected by examining the logs of the jobs triggered by the nightly sync. This was determined to be inaccurate once there true negatives were discovered through various inbound support channels. The team then pivoted to work through historical records of each group in the database. While this process would guarantee a better recall rate, it was slow, and it became clear that restoration of the data needed to occur from a database snapshot. As Saturday went on, the total number of affected groups would slowly increase, creating a bottleneck and making it difficult to communicate current status and affected customers. Another bottleneck occurred with the push & audit team when some restored third-party apps and their groups stopped accepting updates for various reasons. Once there was confirmation that all impacted groups were recalled, Rippling sent a communication \(on Sunday, June 30 2024 03:56 PM PDT\) to affected customers with: 1. A list of groups that were restored and audited 2. A list of groups that would need to be manually pushed at that point in time While the incident would be fully resolved and all groups would be audited by Rippling \(including those in the second list above\), we wanted to ensure administrators knew which third-party app groups were still unresolved prior to the Monday working day. ### Resolution Resolution of the incident was approximately 3 days after 90% of groups were mitigated. While 99% of groups were mitigated 2 days prior to resolution, Rippling considered the incident resolved once we determined _all_ impacted third-party groups accessible via API were mitigated. 0.06% of impacted groups were not able to be remediated. ## Addressable root causes The following root causes, when addressed, ensure minimal recurrence risk and limited impact. 1. Lack of a test that covered the behavior produced by the code change 2. Lack of a mechanism to “gatekeep” irreversible third-party API calls 3. Limited observability into successful but unintended API behavior across companies 4. Incident response teams were unable to reference break-glass procedures ## Actions since resolution * We’ve prevented changes to code that can impact third-party app access at large * Created a set of monitors for mass access changes across companies * Instituted a SOP for a “circuit-breaker”, used in case a true positive is detected ### Other committed action items, in progress * Soak testing pipeline for the nightly sync module * CI/CD checks for force pushed commits from multiple authors * Mass access change approvals, coming as a part of a new integration management app
Status: Postmortem
Impact: Critical | Started At: June 29, 2024, 2:44 p.m.
Description: This incident has been resolved.
Status: Resolved
Impact: Minor | Started At: June 7, 2024, 6:21 p.m.
Join OutLogger to be notified when any of your vendors or the components you use experience an outage or down time. Join for free - no credit card required.