Cloudflare Outage on November 18, 2025

On November 18, 2025, at 11:20 UTC, Cloudflare’s network began experiencing significant failures in delivering core network traffic. Internet users attempting to access sites protected by Cloudflare observed error pages indicating a failure within the network.

The problem was not a result of a cyberattack or malicious activity. Instead, it was initiated by a modification to a database system’s permissions. This change caused the database to generate duplicate entries in a “feature file” utilized by the Bot Management system. Consequently, the feature file’s size doubled, and this oversized file was then distributed across the entire network.

The software responsible for routing network traffic relies on this feature file to maintain the Bot Management system’s effectiveness against evolving threats. However, the software had a predefined limit for the feature file’s size, which was exceeded by the doubled file. This led to the software’s failure.

Initially, the symptoms observed led to a suspicion of a hyper-scale DDoS attack. However, the core issue was correctly identified, allowing for the cessation of the oversized feature file’s propagation and its replacement with a previous, valid version. Core traffic largely returned to normal by 14:30 UTC. Efforts continued over the subsequent hours to manage increased load across different parts of the network as traffic resumed. By 17:06 UTC, all Cloudflare systems were operating normally.

The outage significantly impacted customers and the broader Internet. Given Cloudflare’s critical role, any system outage is considered unacceptable. The inability of the network to route traffic for a period was a serious concern. This post provides a detailed account of the incident, including system and process failures, and outlines initial steps to prevent similar outages in the future.

The Outage

The volume of 5xx error HTTP status codes served by the Cloudflare network is typically very low, a state that persisted until the outage began.

Before 11:20 UTC, the observed 5xx error volume represented the expected baseline. The subsequent spike and fluctuations indicated system failure caused by loading an incorrect feature file. A notable aspect was the system’s intermittent recovery, which was atypical for an internal error.

The issue stemmed from the feature file being generated every five minutes by a query on a ClickHouse database cluster, which was undergoing a gradual update for permissions management. Incorrect data was produced only when the query executed on an updated portion of the cluster. This led to a situation where, every five minutes, either a correct or an erroneous set of configuration files could be generated and quickly distributed across the network.

The fluctuating nature of the problem, with the system recovering and failing repeatedly as both valid and invalid configuration files were distributed, initially obscured the root cause. This led to an initial suspicion that an attack might be underway. Ultimately, all ClickHouse nodes began generating the faulty configuration file, and the system stabilized in a persistent failure state.

Errors persisted until the underlying issue was identified and resolved, beginning at 14:30 UTC. The resolution involved halting the generation and distribution of the faulty feature file, manually injecting a known good file into the distribution queue, and then initiating a restart of the core proxy.

The extended tail visible in the chart represents the process of restarting remaining services that had entered an erroneous state, with 5xx error code volume returning to normal by 17:06 UTC.

The following services experienced impact:

Core CDN and security services: HTTP 5xx status codes were observed, with end users receiving typical error pages.
Turnstile: Turnstile failed to load.
Workers KV: A significantly elevated level of HTTP 5xx errors occurred as requests to KV’s front-end gateway failed due to the core proxy’s malfunction.
Dashboard: Although largely operational, most users could not log in because Turnstile was unavailable on the login page.
Email Security: Email processing and delivery remained unaffected. However, a temporary loss of access to an IP reputation source reduced spam-detection accuracy and prevented some new-domain-age detections from triggering, with no critical customer impact observed. Failures were also noted in some Auto Move actions; all affected messages have since been reviewed and remediated.
Access: Widespread authentication failures affected most users from the incident’s start until a rollback was initiated at 13:05 UTC. Existing Access sessions remained unaffected. All failed authentication attempts resulted in an error page, preventing users from reaching target applications during the authentication failure. Successful logins during this period were correctly logged. Any Access configuration updates attempted at that time either failed or propagated very slowly, but all configuration updates are now recovered.

In addition to HTTP 5xx errors, significant increases in CDN response latency were observed during the impact period. This was attributed to high CPU consumption by debugging and observability systems, which automatically augment uncaught errors with extra debugging information.

How Cloudflare Processes Requests, and the Cause of the Outage

Every request directed to Cloudflare follows a defined path through its network. Whether originating from a browser, a mobile application API call, or automated service traffic, these requests first terminate at the HTTP and TLS layer. They then proceed into the core proxy system, known internally as FL (Frontline), and subsequently through Pingora, which handles cache lookups or data retrieval from the origin as necessary.

Further details on the core proxy’s operation have been previously shared here.

As a request moves through the core proxy, various security and performance products within the network are applied. The proxy implements each customer’s specific configuration and settings, encompassing tasks from enforcing WAF rules and DDoS protection to directing traffic to the Developer Platform and R2. This is achieved via domain-specific modules that apply configurations and policy rules to the traffic.

The Bot Management module was identified as the source of the outage.

Cloudflare’s Bot Management incorporates a machine learning model designed to generate bot scores for each request traversing the network. These bot scores enable customers to manage which automated traffic is permitted to access their sites.

The model processes a “feature” configuration file as input. Within this context, a feature refers to an individual characteristic utilized by the machine learning model to predict whether a request is automated. The feature configuration file compiles these individual features.

This feature file is updated every few minutes and distributed across the entire network, enabling rapid responses to changes in Internet traffic flows, including new bot types and attacks. Frequent and swift deployment is crucial due to the rapid evolution of malicious tactics.

A modification in the underlying ClickHouse query behavior, responsible for generating this file (detailed further below), resulted in a significant number of duplicate “feature” rows. This altered the size of the previously fixed-size feature configuration file, leading the bots module to trigger an error.

Consequently, the core proxy system, which processes customer traffic, returned HTTP 5xx error codes for any traffic dependent on the bots module. This impact extended to Workers KV and Access, both of which rely on the core proxy.

Independently of this incident, customer traffic was being migrated to a new version of the proxy service, internally designated FL2. Both proxy versions were affected by the issue, though the observed impact varied.

Customers utilizing the new FL2 proxy engine experienced HTTP 5xx errors. Conversely, customers on the older FL proxy engine did not encounter errors, but bot scores were incorrectly generated, leading to all traffic receiving a bot score of zero. This meant customers with rules configured to block bots likely observed numerous false positives. Customers not using bot scores in their rules experienced no impact.

A misleading symptom that initially suggested an attack was the unavailability of Cloudflare’s status page. This page is hosted entirely external to Cloudflare’s infrastructure, with no internal dependencies. Although ultimately a coincidence, its failure led some personnel diagnosing the issue to suspect a coordinated attack targeting both Cloudflare’s systems and its status page. Visitors to the status page during this period encountered an error message.

Internal incident communications expressed concern that the event might be a continuation of recent high-volume Aisuru DDoS attacks.

The Query Behavior Change

As previously noted, a change in the underlying query behavior led to the feature file containing numerous duplicate rows. The database system involved utilizes ClickHouse software.

To understand the context, it is useful to review how ClickHouse distributed queries function. A ClickHouse cluster comprises multiple shards. To query data across all shards, distributed tables (powered by the Distributed table engine) exist within a database named ‘default’. The Distributed engine then queries underlying tables in a database named ‘r0’, where data is stored on each shard of the ClickHouse cluster.

Queries directed to distributed tables execute via a shared system account. As part of ongoing efforts to enhance the security and reliability of distributed queries, work is underway to transition these queries to run under initial user accounts instead.

Previously, ClickHouse users could only view tables in the ‘default’ database when querying table metadata from ClickHouse system tables, such as system.tables or system.columns.

Given that users already possessed implicit access to underlying tables in ‘r0’, a change was implemented at 11:05 UTC to make this access explicit, allowing users to view the metadata of these tables. This ensures that all distributed subqueries execute under the initial user, enabling more granular evaluation of query limits and access grants, thereby preventing a single problematic subquery from affecting others.

The aforementioned change provided all users with accurate metadata for tables they could access. However, previous assumptions dictated that the column list returned by a query similar to the following would only include the ‘default’ database:

SELECT name, type FROM system.columns WHERE table = ‘http_requests_features’ order by name;

The query lacked a filter for the database name. As explicit grants were progressively rolled out to users of a ClickHouse cluster, the query, after the 11:05 UTC change, began returning duplicate columns. These duplicates corresponded to underlying tables stored in the ‘r0’ database.

This query type was, unfortunately, used by the Bot Management feature file generation logic to construct each input “feature” for the file discussed earlier.

The query would typically return a table of columns similar to the simplified example shown.

However, with the additional permissions granted to the user, the response subsequently included all metadata from the ‘r0’ schema. This effectively more than doubled the rows in the response, ultimately increasing the number of features in the final file output.

Memory Preallocation

Each module operating on the proxy service incorporates limits to prevent unbounded memory consumption and to preallocate memory for performance optimization. Specifically, the Bot Management system has a runtime limit on the number of machine learning features, currently set at 200, significantly higher than the typical usage of approximately 60 features. This limit is in place to facilitate memory preallocation for performance reasons.

When the faulty file, containing over 200 features, was distributed to the servers, this limit was exceeded, causing the system to panic. The FL2 Rust code responsible for this check and the unhandled error is presented below:

This led to the following panic, which subsequently caused a 5xx error:

thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

Other Impact During the Incident

Other systems dependent on the core proxy, including Workers KV and Cloudflare Access, were affected during the incident. Impact to these systems was mitigated at 13:04 UTC when a patch was implemented for Workers KV to bypass the core proxy. Consequently, all downstream systems relying on Workers KV (such as Access) experienced a reduced error rate.

The Cloudflare Dashboard also experienced impact, attributed to both the internal use of Workers KV and the deployment of Cloudflare Turnstile within its login process.

Turnstile was affected by the outage, preventing customers without an active dashboard session from logging in. This manifested as reduced availability during two distinct periods: from 11:30 to 13:10 UTC, and again between 14:40 and 15:30 UTC.

The initial period of reduced availability, from 11:30 to 13:10 UTC, resulted from the impact on Workers KV, a dependency for certain control plane and dashboard functions. Restoration occurred at 13:10 UTC when Workers KV bypassed the core proxy system. The second period of dashboard impact followed the restoration of feature configuration data. A backlog of login attempts subsequently overwhelmed the dashboard. This backlog, combined with retry attempts, led to elevated latency and diminished dashboard availability. Scaling control plane concurrency successfully restored availability by approximately 15:30 UTC.

Remediation and Follow-Up Steps

With systems now online and operating normally, efforts have commenced to enhance their resilience against future failures of this nature. Specific actions include:

Hardening the ingestion process for Cloudflare-generated configuration files, treating them with the same rigor as user-generated input.
Implementing additional global kill switches for features.
Preventing core dumps or other error reports from overwhelming system resources.
Conducting a comprehensive review of failure modes for error conditions across all core proxy modules.

This incident marked Cloudflare’s most significant outage since 2019. While previous outages have rendered the dashboard unavailable or temporarily impacted newer features, no other outage in over six years has halted the majority of core network traffic.

An outage of this magnitude is considered unacceptable. Systems are architected for high resilience to ensure continuous traffic flow. Past outages have consistently prompted the development of new, more resilient systems.

Incident Timeline (UTC)

11:05 – Database access control change deployed.
11:28 – Impact starts. Deployment reached customer environments, with the first errors observed on customer HTTP traffic.
11:32-13:05 – Investigation of Workers KV service issues. Elevated traffic levels and errors to the Workers KV service were investigated. The initial symptom appeared to be a degraded Workers KV response rate, causing downstream impact on other Cloudflare services. Mitigations, including traffic manipulation and account limiting, were attempted to restore the Workers KV service to normal operating levels. The issue was first detected by an automated test at 11:31, with manual investigation beginning at 11:32. The incident call was initiated at 11:35.
13:05 – Workers KV and Cloudflare Access bypass implemented; impact reduced. During the investigation, internal system bypasses were utilized for Workers KV and Cloudflare Access, reverting them to a prior version of the core proxy. Although the issue was also present in earlier proxy versions, the impact was less severe.
13:37 – Work focused on rollback of Bot Management configuration file. Confidence grew that the Bot Management configuration file triggered the incident. Teams pursued multiple workstreams to repair the service, with the fastest being the restoration of a previous version of the file.
14:24 – Creation and propagation of new Bot Management configuration files stopped. The Bot Management module was identified as the source of the 500 errors, caused by a faulty configuration file. Automatic deployment of new Bot Management configuration files was halted.
14:24 – Test of new file complete. Successful recovery was observed using the old version of the configuration file, prompting efforts to accelerate the global fix.
14:30 – Main impact resolved. Downstream impacted services began observing reduced errors. A correct Bot Management configuration file was deployed globally, and most services started operating correctly.
17:06 – All services resolved. Impact ends. All downstream services were restarted, and all operations were fully restored.

Latest Post

Build Resilient Generative AI Agents

Accelerating Stable Diffusion XL Inference with JAX on Cloud TPU v5e

Older Tech In The Browser Stack

Build Resilient Generative AI Agents

Managing Cloudflare at Enterprise Scale with Infrastructure as Code and Shift-Left Principles

Design System Annotations: Why Accessibility is Often Overlooked in Component Design (Part 1)

ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

Latest Post

Build Resilient Generative AI Agents

Accelerating Stable Diffusion XL Inference with JAX on Cloud TPU v5e

Older Tech In The Browser Stack

Latest Post

Cloudflare Outage on November 18, 2025

The Outage

How Cloudflare Processes Requests, and the Cause of the Outage

The Query Behavior Change

Memory Preallocation

Other Impact During the Incident

Remediation and Follow-Up Steps

Incident Timeline (UTC)

Related Posts