How Workers Powers Cloudflare's Internal Maintenance Scheduling Pipeline

Cloudflare maintains data centers in over 330 cities worldwide. While this extensive network might suggest that minor disruptions could go unnoticed during data center operations, disruptive maintenance demands precise planning. As Cloudflare’s infrastructure expanded, manually coordinating these complex operations among infrastructure and network specialists became increasingly challenging.

Tracking every overlapping maintenance request or considering all customer-specific routing rules in real time is no longer feasible for human operators. Manual oversight alone could not ensure that a standard hardware update in one region would not unintentionally interfere with a critical service path elsewhere.

A centralized, automated system was necessary to safeguard the network, providing a comprehensive view of its entire state. Developing this scheduler on Cloudflare Workers enabled programmatic enforcement of safety constraints, ensuring that operational speed does not compromise the reliability of services customers rely on.

This article details the system’s construction and its current performance.

Building a System to De-risk Critical Maintenance Operations

Consider an edge router, part of a small, redundant group of gateways linking the public Internet to numerous Cloudflare data centers within a metropolitan area. In a densely populated city, it is crucial to prevent multiple data centers behind this router cluster from being isolated due to simultaneous router outages.

A further maintenance challenge arises with the Zero Trust product, Dedicated CDN Egress IPs (referred to as “Aegis” for brevity, its former name). This product enables customers to select specific data centers for their user traffic to exit Cloudflare, routing to geographically proximate origin servers for minimal latency. If all chosen data centers for a customer were simultaneously offline, it would result in increased latency and potential 5xx errors, which must be prevented.

The maintenance scheduler addresses such issues by ensuring at least one edge router remains active in a given area. During maintenance scheduling, it can detect if multiple planned events would simultaneously take all data centers within a customer’s Aegis pool offline.

Prior to the scheduler’s implementation, such concurrent disruptive events could lead to customer downtime. Now, the scheduler alerts internal operators to potential conflicts, facilitating the proposal of alternative times to prevent overlaps with other related data center maintenance.

These operational scenarios, including edge router availability and customer-specific rules, are defined as maintenance constraints, enabling more predictable and secure maintenance planning.

Maintenance Constraints

Each constraint begins with a set of proposed maintenance items, such as a network router or a list of servers. The system then identifies all calendar maintenance events that overlap with the proposed maintenance time window.

Subsequently, product APIs are aggregated, including a list of Aegis customer IP pools. Aegis provides a set of IP ranges where a customer has requested egress from specific data center IDs, as illustrated below.

[
    {
      "cidr": "104.28.0.32/32",
      "pool_name": "customer-9876",
      "port_slots": [
        {
          "dc_id": 21,
          "other_colos_enabled": true,
        },
        {
          "dc_id": 45,
          "other_colos_enabled": true,
        }
      ],
      "modified_at": "2023-10-22T13:32:47.213767Z"
    },
]


In this example, data centers 21 and 45 are interdependent, as at least one must remain online for Aegis customer 9876 to receive egress traffic from Cloudflare. Attempting to take both data centers 21 and 45 offline concurrently would trigger an alert from the coordinator, indicating potential unintended consequences for that customer's workload.
An initial, simpler approach involved loading all data—including server relationships, product configurations, and metrics for product and infrastructure health—into a single Worker to compute constraints. However, this method encountered "out of memory" errors even during the proof-of-concept stage.
It became necessary to consider Workers’ platform limits more carefully. This meant loading only the essential data required for processing a constraint's business logic. For instance, a maintenance request for a router in Frankfurt, Germany, would not typically require data from Australia due to a lack of regional overlap. Therefore, data loading should be restricted to neighboring data centers in Germany, necessitating a more efficient method for processing dataset relationships.
Graph Processing on Workers
Analyzing the constraints revealed a recurring pattern: each constraint could be reduced to two fundamental concepts—objects and associations. In graph theory, these correspond to vertices and edges. An object might be a network router, while an association could represent the Aegis pools within a data center that depend on that router's online status. Drawing inspiration from Facebook’s TAO research paper, a graph interface was developed for the product and infrastructure data. The API is structured as follows:
type ObjectID = string

interface MainTAOInterface<TObject, TAssoc, TAssocType> {
  object_get(id: ObjectID): Promise<TObject | undefined>

  assoc_get(id1: ObjectID, atype: TAssocType): AsyncIterable<TAssoc>
}

A key realization was that associations are typed. For instance, a constraint would invoke the graph interface to retrieve specific Aegis product data.
async function constraint(c: AppContext, aegis: TAOAegisClient, datacenters: string[]): Promise<Record<string, PoolAnalysis>> {
  const datacenterEntries = await Promise.all(
    datacenters.map(async (dcID) => {
      const iter = aegis.assoc_get(c, dcID, AegisAssocType.DATACENTER_INSIDE_AEGIS_POOL)
      const pools: string[] = []
      for await (const assoc of iter) {
        pools.push(assoc.id2)
      }
      return [dcID, pools] as const
    }),
  )

  const datacenterToPools = new Map<string, string[]>(datacenterEntries)
  const uniquePools = new Set<string>()
  for (const pools of datacenterToPools.values()) {
    for (const pool of pools) uniquePools.add(pool)
  }

  const poolTotalsEntries = await Promise.all(
    [...uniquePools].map(async (pool) => {
      const total = aegis.assoc_count(c, pool, AegisAssocType.AEGIS_POOL_CONTAINS_DATACENTER)
      return [pool, total] as const
    }),
  )

  const poolTotals = new Map<string, number>(poolTotalsEntries)
  const poolAnalysis: Record<string, PoolAnalysis> = {}
  for (const [dcID, pools] of datacenterToPools.entries()) {
    for (const pool of pools) {
      poolAnalysis[pool] = {
        affectedDatacenters: new Set([dcID]),
        totalDatacenters: poolTotals.get(pool),
      }
    }
  }

  return poolAnalysis
}

The code above utilizes two association types:


DATACENTER_INSIDE_AEGIS_POOL, which identifies the Aegis customer pools a data center belongs to.


AEGIS_POOL_CONTAINS_DATACENTER, which identifies the data centers required by an Aegis pool to serve traffic.


These associations function as inverted indices. While the access pattern remains consistent, the graph implementation now offers greater control over data querying. Previously, all Aegis pools had to be loaded into memory and filtered within the constraint's business logic. Now, only the data relevant to the application is directly fetched.
This interface is powerful because the graph implementation can enhance performance transparently, without adding complexity to the business logic. This approach leverages the scalability of Workers and Cloudflare’s CDN to rapidly retrieve data from internal systems.
Fetch Pipeline
Adopting the new graph implementation led to more targeted API requests. This change dramatically reduced response sizes by a factor of 100, shifting from a few large requests to numerous smaller ones.
Although this resolved the memory overload issue, it introduced a new challenge: a subrequest problem. Instead of a few large HTTP requests, the system now generated an order of magnitude more small requests, consistently exceeding subrequest limits.
To address this, a sophisticated middleware layer was developed between the graph implementation and the fetch API.
export const fetchPipeline = new FetchPipeline()
  .use(requestDeduplicator())
  .use(lruCacher({
    maxItems: 100,
  }))
  .use(cdnCacher())
  .use(backoffRetryer({
    retries: 3,
    baseMs: 100,
    jitter: true,
  }))
  .handler(terminalFetch);

Inspired by Go’s singleflight package, the initial middleware component in the fetch pipeline deduplicates in-flight HTTP requests. This ensures all requests for the same data await a single Promise, preventing duplicate requests within the same Worker. Following this, a lightweight Least Recently Used (LRU) cache is employed to store previously seen requests internally.
After these steps, Cloudflare’s caches.default.match function caches all GET requests within the Worker's operating region. Given diverse data sources with varying performance needs, Time-To-Live (TTL) values are meticulously selected. For instance, real-time data is cached for only one minute, while relatively static infrastructure data might be cached for 1–24 hours. Infrequently updated power management data can be cached longer at the edge.
Beyond these layers, standard exponential backoff, retries, and jitter are implemented. This minimizes wasted fetch calls when a downstream resource is temporarily unavailable. A slight backoff increases the likelihood of successful subsequent requests. Conversely, continuous requests without backoff would quickly exceed subrequest limits if the origin began returning 5xx errors.
Collectively, these optimizations achieved approximately a 99% cache hit rate. The cache hit rate represents the percentage of HTTP requests served from Cloudflare’s fast cache memory ("hit") compared to slower requests to data sources in the control plane ("miss"), calculated as (hits / (hits + misses)). A high rate signifies improved HTTP request performance and reduced costs, as querying data from the Worker's cache is significantly faster than fetching from an origin server in another region. After fine-tuning in-memory and CDN cache settings, hit rates substantially increased. However, a 100% hit rate is unattainable due to the real-time nature of much of the workload, requiring fresh data requests at least once per minute.
While the fetching layer improvements have been discussed, the acceleration of origin HTTP requests also played a crucial role. The maintenance coordinator must respond in real-time to network degradation and machine failures in data centers. Cloudflare's distributed Prometheus query engine, Thanos, is utilized to deliver high-performance metrics from the edge to the coordinator.
Thanos in Real-Time
To illustrate the impact of the graph processing interface on real-time queries, consider an example. Analyzing the health of edge routers might initially involve the following query:
sum by (instance) (network_snmp_interface_admin_status{instance=~"edge.*"})

Initially, the Thanos service, responsible for storing Prometheus metrics, was queried for a list of each edge router’s current health status. Relevant routers for maintenance were then manually filtered within the Worker. This approach proved suboptimal for several reasons. Thanos, for instance, returned multi-megabyte responses that required decoding and encoding. The Worker also had to cache and decode these large HTTP responses, only to discard most of the data when processing a specific maintenance request. Given TypeScript's single-threaded nature and JSON parsing being CPU-bound, sending two large HTTP requests meant one would be blocked awaiting the other's parsing completion.
Instead, the graph is used to identify targeted relationships, such as the interface links between edge and spine routers, designated as EDGE_ROUTER_NETWORK_CONNECTS_TO_SPINE.
sum by (lldp_name) (network_snmp_interface_admin_status{instance=~"edge01.fra03", lldp_name=~"spine.*"})

This approach yields responses averaging 1 KB, a reduction of approximately 1000x compared to multi-megabyte responses. It also significantly reduces the CPU load within the Worker by offloading most deserialization to Thanos. As previously noted, this necessitates a greater number of smaller fetch requests, but load balancers positioned in front of Thanos can distribute these requests evenly, enhancing throughput for this specific use case.
The graph implementation and fetch pipeline effectively managed the 'thundering herd' of numerous small real-time requests. However, historical analysis introduced a distinct I/O challenge. Rather than retrieving small, specific relationships, the requirement was to scan months of data to identify conflicting maintenance windows. Previously, Thanos generated a large volume of random reads to the R2 object store. To mitigate this substantial bandwidth penalty while maintaining performance, a new approach developed internally by the Observability team was adopted this year.
Historical Data Analysis
Given the numerous maintenance scenarios, reliance on historical data is crucial to assess the solution's accuracy and scalability with Cloudflare’s network expansion. The goal is to prevent incidents while avoiding unnecessary delays for proposed physical maintenance. To balance these objectives, time series data from maintenance events occurring months or even a year prior can indicate the frequency with which a maintenance event violates constraints, such as edge router availability or Aegis. Previous discussions detailed using Thanos for automatic software release and reversion at the edge.
Thanos typically distributes queries to Prometheus; however, when Prometheus' retention is insufficient, data must be retrieved from object storage, specifically R2. Prometheus TSDB blocks, originally optimized for local SSDs, utilize random access patterns that become a bottleneck when transferred to object storage. When the scheduler analyzes months of historical maintenance data to pinpoint conflicting constraints, random reads from object storage result in a significant I/O penalty. To overcome this, a conversion layer was implemented to transform these blocks into Apache Parquet files. Parquet, a columnar format designed for big data analytics, organizes data by column rather than row. This, combined with rich statistics, enables fetching only the necessary data.
Moreover, by rewriting TSDB blocks into Parquet files, data can be stored in a format that facilitates reading it in a few large, sequential chunks.
sum by (instance) (hmd:release_scopes:enabled{dc_id="45"})

In the preceding example, the tuple “(__name__, dc_id)” would be selected as a primary sorting key. This ensures that metrics with the name “hmd:release_scopes:enabled” and identical “dc_id” values are sorted adjacently.
The Parquet gateway now issues precise R2 range requests, fetching only the specific columns pertinent to the query. This reduces payload sizes from megabytes to kilobytes. Additionally, as these file segments are immutable, they can be aggressively cached on the Cloudflare CDN.
This transformation converts R2 into a low-latency query engine, enabling instant backtesting of complex maintenance scenarios against long-term trends. This avoids the timeouts and high tail latency experienced with the original TSDB format. A recent load test demonstrated Parquet achieving up to 15x the P90 performance compared to the previous system for identical query patterns.
For a more in-depth understanding of the Parquet implementation, a talk at PromCon EU 2025, 

, is available.
Building for Scale
By utilizing Cloudflare Workers, the system evolved from one prone to out-of-memory errors to an intelligent data caching solution that employs efficient observability tools for real-time analysis of product and infrastructure data. This maintenance scheduler effectively balances network expansion with product performance.
However, maintaining this balance is an ongoing challenge.
Daily hardware additions globally, coupled with an increasing number of products and maintenance operation types, exponentially complicate the logic needed to maintain the network without disrupting customer traffic. Initial challenges have been addressed, but more subtle and complex issues, unique to this massive scale, are now emerging.
Engineers who are adept at solving complex problems are sought. Consider joining the Infrastructure team to contribute to these efforts.

Latest Post

Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic

Historical Data Analysis

Building for Scale

How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic

Mozilla Leaders Advocate for Open Source AI as a Path to Sovereignty at India AI Impact Summit

A Video Codec’s Emmy Win: The Story of AV1

ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

Latest Post

Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic

Latest Post

How Workers Powers Cloudflare’s Internal Maintenance Scheduling Pipeline

Building a System to De-risk Critical Maintenance Operations

Maintenance Constraints

Graph Processing on Workers

Fetch Pipeline

Thanos in Real-Time

Historical Data Analysis

Building for Scale

Related Posts