Close Menu
    Latest Post

    Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

    February 22, 2026

    Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

    February 22, 2026

    How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic

    February 21, 2026
    Facebook X (Twitter) Instagram
    Trending
    • Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations
    • Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling
    • How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic
    • Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry
    • How to Cancel Your Google Pixel Watch Fitbit Premium Trial
    • GHD Speed Hair Dryer Review: Powerful Performance and User-Friendly Design
    • An FBI ‘Asset’ Helped Run a Dark Web Site That Sold Fentanyl-Laced Drugs for Years
    • The Next Next Job, a framework for making big career decisions
    Facebook X (Twitter) Instagram Pinterest Vimeo
    NodeTodayNodeToday
    • Home
    • AI
    • Dev
    • Guides
    • Products
    • Security
    • Startups
    • Tech
    • Tools
    NodeTodayNodeToday
    Home»Tools»Building Prometheus: Backend Aggregation for Gigawatt-Scale AI Clusters
    Tools

    Building Prometheus: Backend Aggregation for Gigawatt-Scale AI Clusters

    Samuel AlejandroBy Samuel AlejandroFebruary 14, 2026No Comments5 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    src 5kndro featured
    default
    Share
    Facebook Twitter LinkedIn Pinterest Email

    POSTED ON FEBRUARY 9, 2026 TO Data Center Engineering

    Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters

    Image 1

    • Details are being shared regarding the role of backend aggregation (BAG) in constructing gigawatt-scale AI clusters, such as Prometheus.
    • BAG enables the seamless connection of thousands of GPUs across various data centers and regions.
    • The BAG implementation connects two distinct network fabrics: Disaggregated Schedule Fabric (DSF) and Non-Scheduled Fabric (NSF).

    Upon completion, the AI cluster known as Prometheus is projected to provide 1-gigawatt of capacity. This capacity will enhance and facilitate new and existing AI experiences across various products. Prometheus’ infrastructure is designed to span multiple data center buildings within a single large region, interconnecting tens of thousands of GPUs.

    Backend aggregation (BAG) is a crucial component for scaling and connecting this infrastructure. It is utilized to seamlessly link GPUs and data centers through robust, high-capacity networking. By employing modular hardware, advanced routing, and resilient topologies, BAG guarantees both performance and reliability on an unprecedented scale.

    As AI clusters continue to expand, BAG is anticipated to play a significant role in addressing future demands and fostering innovation across the global network.

    What Is Backend Aggregation?

    BAG represents a centralized Ethernet-based super spine network layer. Its primary function is to interconnect multiple spine layer fabrics across various data centers and regions within large clusters. In the context of Prometheus, for instance, the BAG layer acts as the aggregation point between regional networks and the backbone, facilitating the formation of mega AI clusters. BAG is engineered to accommodate substantial bandwidth requirements, with inter-BAG capacities capable of reaching the petabit range (e.g., 16-48 Pbps per region pair).

    Image 2Backend aggregation (BAG) is employed to interconnect data center regions, allowing for the sharing of compute and other resources within large clusters.

    How BAG Enables Gigawatt-Scale AI Clusters

    To address the challenge of interconnecting tens of thousands of GPUs, distributed BAG layers are being deployed regionally.

    Interconnecting BAG Layers

    BAG layers are strategically distributed across regions to serve subsets of L2 fabrics, adhering to constraints related to distance, buffer, and latency. Inter-BAG connectivity employs either a planar (direct match) or spread connection topology, with the choice determined by site size and fiber availability.

    • Planar topology connects BAG switches one-to-one between regions following the plane, offering simplified management but concentrating potential failure domains.
    • Spread connection topology distributes links across multiple BAG switches/planes, enhancing path diversity and resilience.

    Image 3An example of an inter-BAG network topology.

    Connecting a BAG Layer to L2 Fabrics

    The interconnection of BAG layers has been discussed; now, the connection of a BAG layer downstream to L2 fabrics will be examined.

    Two primary fabric technologies, Disaggregated Schedule Fabric (DSF) and Non-Scheduled Fabric (NSF), have been utilized to construct L2 networks.

    An example of DSF L2 zones across five data center buildings is shown below, connected to the BAG layer through a dedicated backend edge pod in each building.

    Image 4A BAG inter-building connection for DSF fabric across five data centers.

    An example of NSF L2 connected to BAG planes is provided below. Each BAG plane connects to corresponding Spine Training Switches (STSWs) from all spine planes, resulting in an effective oversubscription of 4.98:1.

    Image 5A BAG inter-building connection for NSF fabric.

    Careful management of oversubscription ratios helps balance scale and performance. Typical oversubscription from L2 to BAG is approximately 4.5:1, whereas BAG-to-BAG oversubscription differs according to regional requirements and link capacity.

    Hardware and Routing

    The BAG implementation utilizes a modular chassis equipped with Jericho3 (J3) ASIC line cards. Each card offers up to 432x800G ports, enabling high-capacity, scalable, and resilient interconnectivity. The central hub BAG employs a larger chassis to support numerous spokes and long-distance links, incorporating varied cable lengths for optimized buffer utilization.

    Routing within BAG employs eBGP with link bandwidth attributes, which facilitates Unequal Cost Multipath (UCMP) for efficient load balancing and robust failure handling. BAG-to-BAG connections are secured using MACsec, in accordance with network security requirements.

    Designing the Network for Resilience

    The network design meticulously details port striping, IP addressing schemes, and comprehensive failure domain analysis. This ensures high availability and minimizes the impact of failures. Various strategies are also employed to mitigate blackholing risks, such as draining affected BAG planes and conditional route aggregation.

    Considerations for Long Cable Distances

    A significant advantage of BAG’s distributed architecture is its ability to maintain a short distance from the L2 edge, which is crucial for shallow buffer NSF switches. Longer BAG-to-BAG cable distances necessitate the use of deep buffer switches for the BAG role. This provides a substantial headroom buffer to support lossless congestion control protocols such as PFC.

    Building Prometheus and Beyond

    As a technology, BAG holds an important role in the next generation of AI infrastructure. By centralizing the interconnection of regional networks, BAG facilitates the gigawatt-scale Prometheus cluster, ensuring seamless, high-capacity networking across tens of thousands of GPUs. This thoughtful design, which leverages modular hardware and resilient topologies, positions BAG to not only satisfy the demands of Prometheus but also to propel future innovation and scalability of the global AI network for years to come.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleHow Are Function Components Different from Classes?
    Next Article Lassie, the Swedish Pet Insurtech, Secures $75M Series C Funding After Reaching $100M ARR
    Samuel Alejandro

    Related Posts

    Tools

    How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic

    February 21, 2026
    Tools

    Mozilla Leaders Advocate for Open Source AI as a Path to Sovereignty at India AI Impact Summit

    February 21, 2026
    Tools

    A Video Codec’s Emmy Win: The Story of AV1

    February 20, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Latest Post

    ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

    December 21, 202513 Views

    Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

    December 21, 202511 Views

    Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

    December 21, 202510 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    About

    Welcome to NodeToday, your trusted source for the latest updates in Technology, Artificial Intelligence, and Innovation. We are dedicated to delivering accurate, timely, and insightful content that helps readers stay ahead in a fast-evolving digital world.

    At NodeToday, we cover everything from AI breakthroughs and emerging technologies to product launches, software tools, developer news, and practical guides. Our goal is to simplify complex topics and present them in a clear, engaging, and easy-to-understand way for tech enthusiasts, professionals, and beginners alike.

    Latest Post

    Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

    February 22, 20260 Views

    Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

    February 22, 20260 Views

    How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic

    February 21, 20260 Views
    Recent Posts
    • Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations
    • Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling
    • How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic
    • Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry
    • How to Cancel Your Google Pixel Watch Fitbit Premium Trial
    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Disclaimer
    • Cookie Policy
    © 2026 NodeToday.

    Type above and press Enter to search. Press Esc to cancel.