Code Orange: Fail Small — Cloudflare's Resilience Plan Following Recent Incidents

Cloudflare’s network experienced significant service disruptions on November 18, 2025, lasting over two hours, and again on December 5, 2025, affecting 28% of applications for about 25 minutes. While detailed post-mortem reports were published for both incidents, Cloudflare is implementing a new plan to prevent similar outages in the future.

This initiative is named “Code Orange: Fail Small,” emphasizing the goal of enhancing network resilience against errors that could cause widespread outages. A “Code Orange” designation signifies the highest priority for this project, allowing teams to collaborate across functions and temporarily halt other work to achieve its objectives. A similar “Code Orange” was previously declared by Cloudflare after another major incident, highlighting the critical importance of the current effort.

The “Code Orange” plan focuses on three primary areas:

Implementing controlled rollouts for all network configuration changes, mirroring the existing process for software binary releases.
Thoroughly reviewing, enhancing, and testing the failure modes of all systems managing network traffic to ensure predictable behavior, even during unexpected error states.
Modifying internal “break glass”* procedures and eliminating circular dependencies to enable rapid access to all necessary systems during an incident.

These initiatives will introduce continuous improvements, with each update contributing to increased network resilience. The aim is to significantly enhance Cloudflare’s network stability, particularly against the types of issues that caused recent global incidents.

The incidents have caused significant disruption for users and the Internet, making this work a top priority for Cloudflare.

* “Break glass” procedures at Cloudflare permit specific individuals to temporarily elevate their privileges to perform urgent actions during high-severity scenarios.

What Went Wrong?

During the first incident, users encountered error pages when trying to access Cloudflare-protected sites. In the second, blank pages were displayed.

Both outages shared a common trigger: an instantaneous configuration change deployed across Cloudflare’s global data centers.

The November incident stemmed from an automatic update to the Bot Management classifier. This system uses AI models to detect bots by analyzing network traffic, with constant updates to counter evolving threats.

The December incident occurred during efforts to protect users from a React framework vulnerability. A change to a security tool, intended to improve signatures, was deployed with urgency to preempt attackers, initiating the outage.

This recurring pattern highlighted a critical difference in how Cloudflare handles configuration changes compared to software updates. Software releases follow a controlled, monitored process, with deployments progressing through multiple stages and user groups (employees, then increasing percentages of customers) before global rollout. Anomalies trigger automatic rollbacks.

However, this rigorous methodology was not applied to configuration changes. Unlike core software releases, configuration adjustments modify software behavior instantly. This rapid propagation, also available to customers for their settings, carries significant risks. The recent incidents underscore the necessity of treating all network traffic-serving changes with the same level of caution and testing as software updates.

Revising Configuration Update Deployment

The rapid, global deployment of configuration changes was a key factor in both incidents, leading to network disruptions within seconds due to incorrect settings.

A critical component of the “Code Orange” plan involves implementing controlled rollouts for configurations, mirroring the established process for software releases.

Cloudflare’s “Quicksilver” software component enables configuration changes, such as new DNS records or security rules, to propagate to 90% of network servers within seconds. This speed, while beneficial for quick network adjustments, allowed breaking changes to spread globally without prior testing in the recent incidents.

Although instant deployment is sometimes useful, it is not always essential. Efforts are underway to apply the same rigorous deployment controls to configurations as are used for code, integrating these controlled deployments within Quicksilver.

Cloudflare utilizes a Health Mediated Deployment (HMD) system for daily software updates. This framework requires each service-owning team to define success/failure metrics, a rollout plan, and rollback procedures. The HMD toolkit then carefully executes the plan, monitoring each step and automatically initiating rollbacks if failures occur.

Upon completion of “Code Orange,” configuration updates will adopt this HMD process. This change is expected to identify and resolve issues similar to those in the recent incidents much earlier, preventing widespread impact.

Addressing Failure Modes Between Services

While improved configuration control is expected to prevent many incidents, errors are still anticipated. In both recent outages, localized network errors escalated to impact most of the technology stack, including the control plane used by customers.

Graduated rollouts must extend beyond geographic and user-group progression to include service progression, preventing failures from spreading between unrelated products, such as from Bot Management to the customer dashboard.

Cloudflare is reviewing interface contracts for all critical network products and services. The goal is to anticipate failures between interfaces and implement the most reasonable handling mechanisms.

Considering the Bot Management service failure, two key interfaces could have been designed to handle failure gracefully, potentially preventing customer impact. First, the interface reading the corrupted configuration file should have defaulted to a validated, stable state, allowing traffic to pass even if real-time bot detection fine-tuning was temporarily lost. Second, the interface between the core network software and the Bot Management module should not have defaulted to dropping traffic upon module failure. Instead, a default allowing traffic to pass with a basic classification would have been a more resilient approach.

Expediting Emergency Resolution

Incident resolution times were prolonged due to security systems restricting access to necessary tools and circular dependencies causing internal systems to become unavailable.

Cloudflare’s tools are protected by authentication layers and granular access controls to safeguard customer data and prevent unauthorized access. While essential for security, these measures inadvertently hindered rapid response during critical incidents.

Circular dependencies also impacted user experience. For instance, during the November 18 incident, Turnstile, Cloudflare’s CAPTCHA-free bot solution, became inaccessible. Since Turnstile is integrated into the Cloudflare dashboard login process, users without active sessions or API service tokens were unable to log in and make critical changes during the outage.

Cloudflare teams will review and enhance all “break glass” procedures and associated technology. The aim is to ensure swift access to critical tools during emergencies while upholding security standards. This involves identifying and eliminating circular dependencies or establishing quick bypass mechanisms for incidents. Training exercises will also be increased to ensure all teams are proficient in these processes before future disaster scenarios.

Timeline for Completion

The workstreams detailed in this post represent the highest priorities for Cloudflare’s product and engineering teams, with each mapping to a comprehensive plan.

By the end of Q1, and largely sooner, the following objectives are targeted:

All production systems will be covered by Health Mediated Deployments (HMD) for configuration management.
Systems will be updated to properly handle failure modes for each product set.
Processes will be established to ensure appropriate personnel have the necessary access for emergency remediation.

Some of these goals are ongoing, requiring continuous adaptation for new software launches and evolving security technology. Cloudflare acknowledges the impact of recent incidents on users and the Internet and is committed to making improvements, with updates to be shared as progress is made.

Latest Post

WSL is good, but it’s still not enough for me to go back to Windows

Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic

Mozilla Leaders Advocate for Open Source AI as a Path to Sovereignty at India AI Impact Summit

A Video Codec’s Emmy Win: The Story of AV1

ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

Latest Post

WSL is good, but it’s still not enough for me to go back to Windows

Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

Latest Post

Code Orange: Fail Small — Cloudflare’s Resilience Plan Following Recent Incidents

What Went Wrong?

Revising Configuration Update Deployment

Addressing Failure Modes Between Services

Expediting Emergency Resolution

Timeline for Completion

Related Posts