- Conventional RL vs PipelineRL
- PipelineRL works!
- PipelineRL architecture
- What's next for PipelineRL?
- Contributors and Acknowledgement
- Experimental Details
PipelineRL is an experimental Reinforcement Learning (RL) implementation designed to address a fundamental challenge in large-scale RL with Large Language Models (LLMs): the trade-off between inference throughput and on-policy data collection. PipelineRL’s key innovation is inflight weight updates during RL training (see Figure 1 below). This approach allows PipelineRL to achieve constantly high inference throughput and minimize the lag between the weights used for rollouts and the most recently updated model weights. The result is fast and stable RL training for large language models.
This blog post demonstrates that 1) inflight weight updates do not harm the training process and 2) PipelineRL achieves competitive results compared to Open-Reasoner-Zero, even when using a simpler RL algorithm. It also presents the modular PipelineRL architecture, which facilitates experimenting with new inference and trainer combinations.
Conventional RL vs PipelineRL
To achieve high throughput, inference servers must utilize large batch sizes, generating data for multiple policy optimization steps. However, each optimization step increases the lag between the current policy and the data collected by the inference policy, making the collected data progressively more off-policy and less effective for training. On-policy learning necessitates data for a single optimization step. Yet, producing small amounts of data with numerous GPUs is inefficient due to small per-GPU batch sizes. Additionally, the batch size decreases as the inference server completes shorter sequences, leaving only the few longest sequences in progress.
current_policy = initial_policy
opt_state = init_optimizer(current_policy)
while True:
# RL step starts
# inference
inference_policy = current_policy
list_of_prompts = [sample_prompts(training_batch_size) \
for _ in range(num_grad_steps)]
list_of_rollouts = [sample_rollouts(prompts, inference_policy) \
for prompts in list_of_prompts]
# training
lag = 0 # lag between the inference and current policies
for rollouts in list_of_rollouts:
current_policy, opt_state = policy_update(current_policy, opt_state, rollouts)
lag += 1
# RL step ends
PipelineRL (Figure 1b) addresses this trade-off using inflight weight updates. Weights in inference servers are updated after each optimizer step without ever stopping inference. Inference is paused at all inference servers only for the duration required to receive the new weights. Inflight weight updates enable the inference server to consistently maintain the optimal batch size while simultaneously ensuring data remains on-policy or near on-policy, resulting in improved GPU utilization and more effective learning.
PipelineRL works!
To demonstrate PipelineRL’s effectiveness and the benefits of inflight weight updates, a 7B model and a 32B model were trained on the Open-Reasoner-Zero dataset. Analysis of the learning curves shows that PipelineRL matches or exceeds Open-Reasoner’s performance on popular reasoning test benchmarks: AIME 2024 and MATH 500 (see Figure 2 above).
Notably, the RL implementation is much simpler than Open-Reasoner-Zero. While Open-Reasoner-Zero uses a value function, this implementation is a simplified version of GRPO. Specifically, it was observed that trust region importance weight clamping is not required for stable training. Overlong sequence filtering or reward shaping from the DAPO paper were also unnecessary. For normalizing the loss, the number of sequences in the batch is used as the denominator, assigning equal weights to all tokens. No KL penalty or entropy bonus was applied (though the implementation does support reference model KL). Despite the simplicity of this implementation, or perhaps because of it, training demonstrates high stability, as detailed in this wandb report.
One might expect that inflight weight updates could lead to an unstable training process, given that sequence generation proceeds with stale keys and values in the KV cache computed with a previous model version. However, experiments indicate this does not adversely affect stability.
PipelineRL architecture
PipelineRL is designed for modularity, enabling it to leverage rapid advancements in highly-specialized inference and training software (SGLang, vLLM, Nvidia Dynamo, DeepSpeed, FSDP, TorchTitan, FastLLM, etc.). Clear contracts are proposed between the inference and training components, facilitating easy integration of new inference and training solutions as they become available.
Inference contract
The inference software must expose the following APIs to PipelineRL[1]:
- Process group initialization: At start-up time, Trainer 0 (the designated coordinator) sends an HTTP POST /init_process_group request to all inference servers. This request initializes the process group that will be used for sending the weight updates.
- Weight Update Trigger: Once the trainers complete a learning step (optimizer step and weight gathering), Trainer 0 submits an HTTP POST /request_weight_update request to the inference endpoint. The request contains the details on the order and shapes of the weights that the main trainer process is about to transfer via NCCL. The inference servers must pause the inference and receive the weight broadcast.
- Chat completion: The actor process interacts with the actor LLMs using HTTP POST /v1/chat/completion requests.
If init_process_group and request_weight_update APIs become industry standards, it will be possible to plug-and-play different inference implementations with PipelineRL.
Trainer contract
PipelineRL training code feeds freshly-generated training data to trainer workers once the appropriate number of training tokens has accumulated for each. Any training software exposing these Python APIs can be made to work with PipelineRL:
- Worker initialization Load and shard training weights and the optimizer state.
- Forward pass Produce token log-likelihoods given inputs.
- Backward step: Compute and accumulate the gradient of the scalar that represents the chosen RL objective.
- Optimizer Step: Execute the optimizer step.
- Weight gathering and broadcasting: After an optimizer step, the trainer software must gather the updated model weights layer-by-layer in preparation for broadcasting them to the inference servers.
PipelineRL currently utilizes the HuggingFace accelerate library, offering users a choice between DeepSpeed and FSDP. However, the accelerate contract was found to be overly flexible and potentially confusing. A transition to the stricter contract described above is planned to simplify the use of other trainers.
What’s next for PipelineRL?
Upcoming features. The current implementation remains experimental and lacks some important functionality. Top priorities include incorporating coroutines for more precise inference batch size control, multi-modal support, and sequence parallel training. Contributions of additional inference server and trainer integrations would also be welcomed. The pipeline-rl repository will not, however, aim to be a framework supporting all possible algorithms and reward functions. The perspective is that pipeline-rl should serve as a hackable and fast reference implementation of GRPO with easily verifiable rewards. For those interested in a research project using PipelineRL, the repository can be forked for further development.
More research coming soon. Further analysis is required to understand how inflight weight updates influence training dynamics and to precisely measure the speed-ups provided by PipelineRL. Additionally, the similarities between PipelineRL and highly relevant prior work on asynchronous Reinforcement Learning for LLMs warrant discussion. For these insights and more, readers are encouraged to anticipate an upcoming research paper.
Contributors and Acknowledgement
Alexandre Piché developed the initial synchronous version of the RL code during work on TapeAgents. Dzmitry Bahdanau refactored the code for asynchronous and distributed operation, and implemented inflight weight updates. Rafael Pardinas implemented sequence packing. Ehsan Kamaloo assisted with experiment execution, and Xiaoyin Chen provided debugging support for the framework.
Prior RL for LLM implementations, including TRL, OpenRLHF, and veRL, are acknowledged for various techniques adopted. Artifacts from other open-source reasoning projects, such as Simple-RL, Deepscaler, DAPO, and OpenReasoner, were instrumental in stabilizing PipelineRL. Christopher Manning and Michael Noukhovitch are recognized for their thoughtful comments. Appreciation is extended to the broader ServiceNow Research team and ServiceNow CoreLLM teams.
[1] The current contract in the code is slightly different, but it is being refactored as described above.
Experimental Details
The same hyperparameters were used for both 7B and 32B experiments reported here:
- batch size 4096
- learning rate 1e-6
- max number of generated tokens 8192
- note that in OpenReasoner runs, generation of 16K tokens was allowed
The compute resources utilized for the reported experiments were:
- ~3.5 days on 2 nodes for the 7B model
- ~6 days on 4 nodes for the 32B model





