Close Menu
    Latest Post

    Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

    February 22, 2026

    Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

    February 22, 2026

    Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

    February 22, 2026
    Facebook X (Twitter) Instagram
    Trending
    • Anker’s X1 Pro shouldn’t exist, but I’m so glad it does
    • Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations
    • Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling
    • How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic
    • Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry
    • How to Cancel Your Google Pixel Watch Fitbit Premium Trial
    • GHD Speed Hair Dryer Review: Powerful Performance and User-Friendly Design
    • An FBI ‘Asset’ Helped Run a Dark Web Site That Sold Fentanyl-Laced Drugs for Years
    Facebook X (Twitter) Instagram Pinterest Vimeo
    NodeTodayNodeToday
    • Home
    • AI
    • Dev
    • Guides
    • Products
    • Security
    • Startups
    • Tech
    • Tools
    NodeTodayNodeToday
    Home»AI»PipelineRL
    AI

    PipelineRL

    Samuel AlejandroBy Samuel AlejandroJanuary 28, 2026No Comments7 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    src 1slv86k featured
    Share
    Facebook Twitter LinkedIn Pinterest Email

    PipelineRL

    • Conventional RL vs PipelineRL
      • PipelineRL works!
        • PipelineRL architecture
          • Inference contract
            • Trainer contract
            • What's next for PipelineRL?
              • Contributors and Acknowledgement
                • Experimental Details

                  PipelineRL is an experimental Reinforcement Learning (RL) implementation designed to address a fundamental challenge in large-scale RL with Large Language Models (LLMs): the trade-off between inference throughput and on-policy data collection. PipelineRL’s key innovation is inflight weight updates during RL training (see Figure 1 below). This approach allows PipelineRL to achieve constantly high inference throughput and minimize the lag between the weights used for rollouts and the most recently updated model weights. The result is fast and stable RL training for large language models.

                  image/jpeg

                  This blog post demonstrates that 1) inflight weight updates do not harm the training process and 2) PipelineRL achieves competitive results compared to Open-Reasoner-Zero, even when using a simpler RL algorithm. It also presents the modular PipelineRL architecture, which facilitates experimenting with new inference and trainer combinations.

                  Conventional RL vs PipelineRL

                  To achieve high throughput, inference servers must utilize large batch sizes, generating data for multiple policy optimization steps. However, each optimization step increases the lag between the current policy and the data collected by the inference policy, making the collected data progressively more off-policy and less effective for training. On-policy learning necessitates data for a single optimization step. Yet, producing small amounts of data with numerous GPUs is inefficient due to small per-GPU batch sizes. Additionally, the batch size decreases as the inference server completes shorter sequences, leaving only the few longest sequences in progress.

                  current_policy = initial_policy
                  opt_state = init_optimizer(current_policy)
                  
                  while True:
                      # RL step starts
                      # inference
                      inference_policy = current_policy
                      list_of_prompts = [sample_prompts(training_batch_size) \
                          for _ in range(num_grad_steps)]
                      list_of_rollouts = [sample_rollouts(prompts, inference_policy) \
                          for prompts in list_of_prompts]
                      # training
                      lag = 0 # lag between the inference and current policies
                      for rollouts in list_of_rollouts:
                          current_policy, opt_state = policy_update(current_policy, opt_state, rollouts)
                          lag += 1
                      # RL step ends
                  

                  PipelineRL (Figure 1b) addresses this trade-off using inflight weight updates. Weights in inference servers are updated after each optimizer step without ever stopping inference. Inference is paused at all inference servers only for the duration required to receive the new weights. Inflight weight updates enable the inference server to consistently maintain the optimal batch size while simultaneously ensuring data remains on-policy or near on-policy, resulting in improved GPU utilization and more effective learning.

                  PipelineRL works!

                  image/png

                  To demonstrate PipelineRL’s effectiveness and the benefits of inflight weight updates, a 7B model and a 32B model were trained on the Open-Reasoner-Zero dataset. Analysis of the learning curves shows that PipelineRL matches or exceeds Open-Reasoner’s performance on popular reasoning test benchmarks: AIME 2024 and MATH 500 (see Figure 2 above).

                  Notably, the RL implementation is much simpler than Open-Reasoner-Zero. While Open-Reasoner-Zero uses a value function, this implementation is a simplified version of GRPO. Specifically, it was observed that trust region importance weight clamping is not required for stable training. Overlong sequence filtering or reward shaping from the DAPO paper were also unnecessary. For normalizing the loss, the number of sequences in the batch is used as the denominator, assigning equal weights to all tokens. No KL penalty or entropy bonus was applied (though the implementation does support reference model KL). Despite the simplicity of this implementation, or perhaps because of it, training demonstrates high stability, as detailed in this wandb report.

                  One might expect that inflight weight updates could lead to an unstable training process, given that sequence generation proceeds with stale keys and values in the KV cache computed with a previous model version. However, experiments indicate this does not adversely affect stability.

                  PipelineRL architecture

                  image/jpeg

                  PipelineRL is designed for modularity, enabling it to leverage rapid advancements in highly-specialized inference and training software (SGLang, vLLM, Nvidia Dynamo, DeepSpeed, FSDP, TorchTitan, FastLLM, etc.). Clear contracts are proposed between the inference and training components, facilitating easy integration of new inference and training solutions as they become available.

                  Inference contract

                  The inference software must expose the following APIs to PipelineRL[1]:

                  1. Process group initialization: At start-up time, Trainer 0 (the designated coordinator) sends an HTTP POST /init_process_group request to all inference servers. This request initializes the process group that will be used for sending the weight updates.
                  2. Weight Update Trigger: Once the trainers complete a learning step (optimizer step and weight gathering), Trainer 0 submits an HTTP POST /request_weight_update request to the inference endpoint. The request contains the details on the order and shapes of the weights that the main trainer process is about to transfer via NCCL. The inference servers must pause the inference and receive the weight broadcast.
                  3. Chat completion: The actor process interacts with the actor LLMs using HTTP POST /v1/chat/completion requests.

                  If init_process_group and request_weight_update APIs become industry standards, it will be possible to plug-and-play different inference implementations with PipelineRL.

                  Trainer contract

                  PipelineRL training code feeds freshly-generated training data to trainer workers once the appropriate number of training tokens has accumulated for each. Any training software exposing these Python APIs can be made to work with PipelineRL:

                  • Worker initialization Load and shard training weights and the optimizer state.
                  • Forward pass Produce token log-likelihoods given inputs.
                  • Backward step: Compute and accumulate the gradient of the scalar that represents the chosen RL objective.
                  • Optimizer Step: Execute the optimizer step.
                  • Weight gathering and broadcasting: After an optimizer step, the trainer software must gather the updated model weights layer-by-layer in preparation for broadcasting them to the inference servers.

                  PipelineRL currently utilizes the HuggingFace accelerate library, offering users a choice between DeepSpeed and FSDP. However, the accelerate contract was found to be overly flexible and potentially confusing. A transition to the stricter contract described above is planned to simplify the use of other trainers.

                  What’s next for PipelineRL?

                  Upcoming features. The current implementation remains experimental and lacks some important functionality. Top priorities include incorporating coroutines for more precise inference batch size control, multi-modal support, and sequence parallel training. Contributions of additional inference server and trainer integrations would also be welcomed. The pipeline-rl repository will not, however, aim to be a framework supporting all possible algorithms and reward functions. The perspective is that pipeline-rl should serve as a hackable and fast reference implementation of GRPO with easily verifiable rewards. For those interested in a research project using PipelineRL, the repository can be forked for further development.

                  More research coming soon. Further analysis is required to understand how inflight weight updates influence training dynamics and to precisely measure the speed-ups provided by PipelineRL. Additionally, the similarities between PipelineRL and highly relevant prior work on asynchronous Reinforcement Learning for LLMs warrant discussion. For these insights and more, readers are encouraged to anticipate an upcoming research paper.

                  Contributors and Acknowledgement

                  Alexandre Piché developed the initial synchronous version of the RL code during work on TapeAgents. Dzmitry Bahdanau refactored the code for asynchronous and distributed operation, and implemented inflight weight updates. Rafael Pardinas implemented sequence packing. Ehsan Kamaloo assisted with experiment execution, and Xiaoyin Chen provided debugging support for the framework.

                  Prior RL for LLM implementations, including TRL, OpenRLHF, and veRL, are acknowledged for various techniques adopted. Artifacts from other open-source reasoning projects, such as Simple-RL, Deepscaler, DAPO, and OpenReasoner, were instrumental in stabilizing PipelineRL. Christopher Manning and Michael Noukhovitch are recognized for their thoughtful comments. Appreciation is extended to the broader ServiceNow Research team and ServiceNow CoreLLM teams.

                  [1] The current contract in the code is slightly different, but it is being refactored as described above.

                  Experimental Details

                  The same hyperparameters were used for both 7B and 32B experiments reported here:

                  • batch size 4096
                  • learning rate 1e-6
                  • max number of generated tokens 8192
                    • note that in OpenReasoner runs, generation of 16K tokens was allowed

                  The compute resources utilized for the reported experiments were:

                  • ~3.5 days on 2 nodes for the 7B model
                  • ~6 days on 4 nodes for the 32B model
                  Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
                  Previous ArticleHow Does setState Know What to Do?
                  Next Article Rust at Scale: An Added Layer of Security for WhatsApp
                  Samuel Alejandro

                  Related Posts

                  AI

                  Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry

                  February 21, 2026
                  Tech

                  ChatGPT’s Dominance Among Young Indians: Usage Insights from OpenAI

                  February 20, 2026
                  AI

                  SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds

                  February 19, 2026
                  Add A Comment
                  Leave A Reply Cancel Reply

                  Latest Post

                  ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

                  December 21, 202513 Views

                  Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

                  December 21, 202511 Views

                  Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

                  December 21, 202510 Views
                  Stay In Touch
                  • Facebook
                  • YouTube
                  • TikTok
                  • WhatsApp
                  • Twitter
                  • Instagram
                  About

                  Welcome to NodeToday, your trusted source for the latest updates in Technology, Artificial Intelligence, and Innovation. We are dedicated to delivering accurate, timely, and insightful content that helps readers stay ahead in a fast-evolving digital world.

                  At NodeToday, we cover everything from AI breakthroughs and emerging technologies to product launches, software tools, developer news, and practical guides. Our goal is to simplify complex topics and present them in a clear, engaging, and easy-to-understand way for tech enthusiasts, professionals, and beginners alike.

                  Latest Post

                  Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

                  February 22, 20260 Views

                  Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

                  February 22, 20260 Views

                  Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

                  February 22, 20260 Views
                  Recent Posts
                  • Anker’s X1 Pro shouldn’t exist, but I’m so glad it does
                  • Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations
                  • Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling
                  • How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic
                  • Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry
                  Facebook X (Twitter) Instagram Pinterest
                  • About Us
                  • Contact Us
                  • Privacy Policy
                  • Terms & Conditions
                  • Disclaimer
                  • Cookie Policy
                  © 2026 NodeToday.

                  Type above and press Enter to search. Press Esc to cancel.