
On December 5, 2025, at 08:47 UTC, a segment of Cloudflare’s network encountered substantial failures. The issue was resolved by 09:12 UTC, restoring all services after approximately 25 minutes of impact.
About 28% of Cloudflare’s total HTTP traffic was affected, impacting a specific group of customers. Several conditions had to be met for a customer to experience the described issues.
The problem did not stem from a cyber attack or any malicious activity. Instead, it was initiated by modifications to body parsing logic, implemented to detect and mitigate an industry-wide vulnerability recently disclosed in React Server Components.
Service outages are considered unacceptable. Following an incident on November 18, details will be published next week regarding efforts to prevent similar occurrences.
What happened
The graph below illustrates HTTP 500 errors served by the network during the incident (red line), in comparison to total unaffected Cloudflare traffic (green line).
Cloudflare’s Web Application Firewall (WAF) protects customers from malicious payloads by detecting and blocking them. To achieve this, Cloudflare’s proxy buffers HTTP request body content in memory for analysis. Previously, the buffer size was set to 128KB.
As part of ongoing efforts to protect React users from a critical vulnerability, CVE-2025-55182, an increase to the buffer size to 1MB, the default limit for Next.js applications, began rolling out to ensure maximum customer protection.
This initial modification was deployed using a gradual rollout system. During this process, it was observed that an internal WAF testing tool did not support the increased buffer size. Since this internal tool was not critical at the time and did not affect customer traffic, a second change was made to disable it.
The second change, disabling the WAF testing tool, was implemented via a global configuration system. This system propagates changes across the entire server fleet within seconds, rather than gradually. This system is currently under review following an outage on November 18.
Unfortunately, in the FL1 version of the proxy, under specific conditions, disabling the WAF rule testing tool led to an error state, resulting in HTTP 500 error codes being served from the network.
Upon propagation of the change across the network, code execution in the FL1 proxy encountered a bug in its rules module, triggering the following Lua exception:
[lua] Failed to run module rulesets callback late_routing: /usr/local/nginx-fl/lua/modules/init.lua:314: attempt to index field 'execute' (a nil value)
This resulted in HTTP 500 errors.
The issue was identified shortly after the change was applied and was reverted at 09:12 UTC, after which all traffic was correctly served.
Customers whose web assets were served by the older FL1 proxy AND had the Cloudflare Managed Ruleset deployed were impacted. All requests for websites in this configuration returned an HTTP 500 error, with minor exceptions for certain test endpoints like /cdn-cgi/trace.
Customers without the specified configuration were not affected. Traffic served by the China network also remained unaffected.
The runtime error
Cloudflare’s ruleset system comprises sets of rules evaluated for each incoming request. A rule includes a filter to select traffic and an action to apply an effect, such as “block,” “log,” or “skip.” An “execute” action triggers the evaluation of another ruleset.
An internal logging system leverages this feature to evaluate new rules before public release. A top-level ruleset executes another ruleset containing test rules, which were the target of the disablement attempt.
A killswitch subsystem, integrated into the ruleset system, allows for rapid disabling of misbehaving rules. This killswitch system receives data from the global configuration system previously mentioned. It has been utilized in past incidents to mitigate issues, following a defined Standard Operating Procedure, which was adhered to in this incident.
However, a killswitch had not previously been applied to a rule with an “execute” action. When the killswitch was activated, the code correctly bypassed the evaluation of the execute action and did not evaluate the associated sub-ruleset. Nevertheless, an error occurred during the processing of the overall ruleset evaluation results:
if rule_result.action == "execute" then
rule_result.execute.results = ruleset_results[tonumber(rule_result.execute.results_index)]
end
The code anticipates that if a ruleset has action="execute", the rule_result.execute object will exist. Because the rule was skipped, the rule_result.execute object was absent, leading to a Lua error when attempting to access a value within a nil object.
This is a simple code error that remained undetected for many years. Such errors are typically prevented by languages with strong type systems. In the FL2 proxy, which is a Rust-based replacement for this code, this error did not manifest.
What about the changes being made after the incident on November 18, 2025?
An unrelated change caused a similar, longer availability incident two weeks prior, on November 18, 2025. In both instances, a deployment intended to mitigate a security issue for customers propagated across the entire network, leading to errors for most of the customer base.
Following that incident, discussions were held with numerous customers, and plans were shared for modifications to prevent single updates from causing such widespread impact. These changes are believed to have the potential to prevent the impact of the current incident, but their deployment is not yet complete.
The delay in completing this work is acknowledged as disappointing. It remains a top organizational priority. Specifically, the following projects are expected to help contain the impact of similar changes:
-
Enhanced Rollouts & Versioning: Similar to the slow deployment of software with strict health validation, data used for rapid threat response and general configuration requires comparable safety and blast mitigation features. This encompasses health validation and quick rollback capabilities, among others.
-
Streamlined break glass capabilities: Critical operations must remain achievable even when facing additional types of failures. This applies to internal services and all standard methods of interaction with the Cloudflare control plane used by customers.
-
"Fail-Open" Error Handling: As part of resilience efforts, the incorrectly applied hard-fail logic across all critical Cloudflare data-plane components is being replaced. If a configuration file is corrupt or out-of-range (e.g., exceeding feature caps), the system will log the error and default to a known-good state or pass traffic without scoring, rather than dropping requests. Some services may offer customers the option to fail open or closed in certain scenarios, including drift-prevention capabilities to ensure continuous enforcement.
A detailed breakdown of all ongoing resiliency projects, including those listed above, will be published before the end of next week. While this work is in progress, all network changes are being locked down to ensure improved mitigation and rollback systems are in place before further deployments.
The frequency and proximity of these incidents are deemed unacceptable for a network of this scale. Cloudflare extends an apology for the impact and distress this has caused customers and the Internet as a whole.
Timeline
Time (UTC)
Status
Description
08:47
INCIDENT start
Configuration change deployed and propagated to the network
08:48
Full impact
Change fully propagated
08:50
INCIDENT declared
Automated alerts
09:11
Change reverted
Configuration change reverted and propagation start
09:12
INCIDENT end
Revert fully propagated, all traffic restored

