Managing Cloudflare at Enterprise Scale with Infrastructure as Code and Shift-Left Principles

The Cloudflare platform serves as a critical internal system, with the company acting as its own “Customer Zero” by utilizing its products to secure and optimize its services. A dedicated Customer Zero team within the security division leverages this unique position to provide continuous feedback to product and engineering, fostering ongoing product improvement. This operation occurs at a global scale, where a single misconfiguration can rapidly spread across the edge network, leading to significant unintended consequences. The challenge lies in consistently securing hundreds of internal production Cloudflare accounts while minimizing human error.

While the Cloudflare dashboard offers excellent capabilities for observability and analytics, manually configuring settings across numerous accounts is prone to mistakes. To maintain security and operational integrity, configurations are no longer treated as manual tasks but rather as code. This involves adopting “shift left” principles, integrating security checks into the earliest phases of development. This strategic shift was essential for preventing errors before they could cause incidents and necessitated a fundamental change in governance architecture.

Understanding Shift Left Principles

The concept of “shifting left” involves integrating validation steps earlier into the software development lifecycle (SDLC). This means incorporating testing, security audits, and policy compliance checks directly within the continuous integration and continuous deployment (CI/CD) pipeline. Identifying issues or misconfigurations during the merge request stage significantly reduces remediation costs compared to discovering them post-deployment.

Applying shift left principles within Cloudflare emphasizes four core tenets:

Consistency: Configurations should be easily replicable and reusable across various accounts.
Scalability: Significant changes must be deployable swiftly across numerous accounts.
Observability: Configurations need to be auditable by any authorized individual to verify their current state, accuracy, and security posture.
Governance: Proactive guardrails are essential, enforced prior to deployment to prevent incidents.

Implementing a Production IaC Operating Model

To facilitate this approach, all production accounts transitioned to management via Infrastructure as Code (IaC). Each modification is meticulously tracked, linked to a specific user, commit, and an internal ticket. While the dashboard remains valuable for analytics, all critical production changes are executed through code.

This model guarantees that every change undergoes peer review, and security policies, established by the security team, are implemented directly by the respective engineering teams responsible for the configurations.

The foundation of this architecture relies on two primary technologies: Terraform and a bespoke CI/CD pipeline.

The Enterprise IaC Stack

Terraform was selected due to its robust open-source ecosystem, extensive community support, and seamless integration with Policy as Code tools. Internally utilizing the Cloudflare Terraform Provider also enables the team to “dogfood” the product, enhancing the experience for external customers.

To handle hundreds of accounts and approximately 30 merge requests daily, the CI/CD pipeline operates on Atlantis, integrated with GitLab. A custom Go program, tfstate-butler, functions as a broker for secure state file storage.

tfstate-butler serves as an HTTP backend for Terraform, designed with security as its paramount concern. It ensures unique encryption keys for each state file, thereby minimizing the impact of any potential security breach.

All internal account configurations reside within a centralized monorepo. Individual teams are responsible for their specific configurations and act as code owners for their respective sections of this repository, fostering clear accountability. Further details on this configuration can be found in How Cloudflare uses Terraform to manage Cloudflare.

Infrastructure as Code Data Flow Diagram

Baselines and Policy as Code

The success of the shift-left strategy relies on establishing a robust security baseline for all internal production Cloudflare accounts. This baseline comprises security policies defined as code (Policy as Code). It represents a mandatory security configuration enforced across the platform, covering aspects like maximum session length, required logging, and specific WAF configurations.

This framework transitions policy enforcement from manual audits to automated gates. The Open Policy Agent (OPA) framework and its policy language, Rego, are utilized through the Atlantis Conftest Policy Checking feature to achieve this.

Defining Policies as Code

Rego policies articulate the precise security requirements that form the baseline for all Cloudflare provider resources. Approximately 50 such policies are currently maintained.

An example Rego policy, shown below, validates that only @cloudflare.com email addresses are permissible within an access policy:

# validate no use of non-cloudflare email
warn contains reason if {
    r := tfplan.resource_changes[_]
    r.mode == "managed"
    r.type == "cloudflare_access_policy"

    include := r.change.after.include[_]
    email_address := include.email[_]
    not endswith(email_address, "@cloudflare.com")

    reason := sprintf("%-40s :: only @cloudflare.com emails are allowed", [r.address])
}
warn contains reason if {
    r := tfplan.resource_changes[_]
    r.mode == "managed"
    r.type == "cloudflare_access_policy"

    require := r.change.after.require[_]
    email_address := require.email[_]
    not endswith(email_address, "@cloudflare.com")

    reason := sprintf("%-40s :: only @cloudflare.com emails are allowed", [r.address])
}


Enforcing the Baseline
A policy check is executed on every merge request (MR) to confirm configuration compliance prior to deployment. The results of this check are displayed directly within the GitLab MR comment thread.
Policy enforcement functions in two distinct modes:


Warning: A comment is added to the MR, but the merge operation is permitted.


Deny: The deployment is blocked entirely.


Should the policy check identify that a configuration in the MR deviates from the established baseline, the output will specify the non-compliant resources.
The following example illustrates an output from a policy check, highlighting three discrepancies within a merge request:
WARN - cloudflare_zero_trust_access_application.app_saas_xxx :: "session_duration" must be less than or equal to 10h

WARN - cloudflare_zero_trust_access_application.app_saas_xxx_pay_per_crawl :: "session_duration" must be less than or equal to 10h

WARN - cloudflare_zero_trust_access_application.app_saas_ms :: you must have at least one require statement of auth_method = "swk"

41 tests, 38 passed, 3 warnings, 0 failures, 0 exception

Handling Policy Exceptions
While exceptions are sometimes necessary, they are managed with the same strictness as the policies themselves. When a team needs an exception, a request is submitted through Jira.
Upon approval by the Customer Zero team, the exception is formalized by submitting a pull request to the central exceptions.rego repository. Exceptions can be granted at several granular levels:


Account: Exclude a specific account from a particular policy.


Resource Category: Exclude all resources of a certain type within an account from a policy.


Specific Resource: Exclude an individual resource within an account from a policy.


The example below demonstrates a session length exception for five distinct applications across two separate Cloudflare accounts:
{  
    "exception_type": "session_length",
    "exceptions": [
        {
            "account_id": "1xxxx",
              "tf_addresses": [
                "cloudflare_access_application.app_identity_access_denied",
                "cloudflare_access_application.enforcing_ext_auth_worker_bypass",
                "cloudflare_access_application.enforcing_ext_auth_worker_bypass_dev",
            ],
        },
        {
            "account_id": "2xxxx",
              "tf_addresses": [
                "cloudflare_access_application.extra_wildcard_application",
                "cloudflare_access_application.wildcard",
            ],
        },
    ],
}

Challenges and Lessons Learned
The implementation journey encountered several obstacles. Years of "clickops" – manual changes made directly in the dashboard – were prevalent across hundreds of accounts. Integrating this existing, often chaotic, state into a rigorous Infrastructure as Code system proved challenging, akin to performing maintenance on a live system. Resource importation remains an ongoing effort.
Limitations within the tools themselves were also discovered, particularly edge cases in the Cloudflare Terraform provider that emerged only when managing infrastructure at this extensive scale. These experiences provided valuable insights into the importance of "dogfooding" – using one's own products – to develop superior solutions.
These challenges illuminated the complexities involved, resulting in three significant lessons learned.
Lesson 1: High Barriers to Entry Hinder Adoption
A primary challenge in any large-scale IaC deployment is onboarding existing, manually configured resources. Teams were offered two choices: manually creating Terraform resources and import blocks, or utilizing cf-terraforming.
It quickly became apparent that Terraform proficiency varied among teams, and the manual import process for existing resources presented a steeper learning curve than initially expected.
Fortunately, the cf-terraforming command-line utility proved invaluable. It leverages the Cloudflare API to automatically generate the required Terraform code and import statements, substantially expediting the migration. Additionally, an internal community was established, allowing experienced engineers to assist teams with provider intricacies and complex imports.
Lesson 2: Configuration Drift is Inevitable
Addressing configuration drift was another critical task. Drift occurs when the IaC process is bypassed for urgent modifications, such as direct edits in the dashboard during an incident. While quicker in the short term, this practice desynchronizes the Terraform state from the actual deployed infrastructure.
A custom drift detection service was implemented to continuously compare the Terraform-defined state with the live deployed state via the Cloudflare API. Upon detecting drift, an automated system generates an internal ticket, assigning it to the responsible team with specific Service Level Agreements (SLAs) for remediation.
Lesson 3: Automation is Crucial
Cloudflare's rapid innovation leads to a constantly expanding suite of products and APIs. This pace unfortunately meant that the Terraform provider often lagged in feature parity with the core product.
This challenge was resolved with the introduction of the v5 provider, which automatically generates the Terraform provider from the OpenAPI specification. While the transition involved refining the code generation process, this automated approach guarantees synchronization between the API and Terraform, thereby minimizing capability drift.
The Core Lesson: Proactive Over Reactive
By centralizing security baselines, enforcing peer reviews, and applying policies before changes reach production, the potential for configuration errors, accidental deletions, or policy violations is significantly reduced. This architectural approach not only prevents manual mistakes but also enhances engineering velocity, as teams can confidently deploy changes knowing they are compliant.
The primary takeaway from the Customer Zero initiative is clear: while the Cloudflare dashboard is excellent for daily operations, achieving enterprise-level scale and consistent governance necessitates a different methodology. Treating Cloudflare configurations as living code enables secure and confident scaling.

Latest Post

How GitHub Engineers Address Platform Challenges

Key CSS Developments: Conditional View Transitions, Text Effects, and Community Insights

As RAM prices skyrocket and Windows 11 flounders, Linux gains native NVIDIA GeForce NOW support — turning the cloud into a sanctuary for priced-out gamers

How GitHub Engineers Address Platform Challenges

Build Resilient Generative AI Agents

Design System Annotations: Why Accessibility is Often Overlooked in Component Design (Part 1)

ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

Latest Post

How GitHub Engineers Address Platform Challenges

Key CSS Developments: Conditional View Transitions, Text Effects, and Community Insights

As RAM prices skyrocket and Windows 11 flounders, Linux gains native NVIDIA GeForce NOW support — turning the cloud into a sanctuary for priced-out gamers

Latest Post

Managing Cloudflare at Enterprise Scale with Infrastructure as Code and Shift-Left Principles

Understanding Shift Left Principles

Implementing a Production IaC Operating Model

The Enterprise IaC Stack

Baselines and Policy as Code

Defining Policies as Code

Enforcing the Baseline

Handling Policy Exceptions

Challenges and Lessons Learned

Lesson 1: High Barriers to Entry Hinder Adoption

Lesson 2: Configuration Drift is Inevitable

Lesson 3: Automation is Crucial

The Core Lesson: Proactive Over Reactive

Related Posts