This document provides a comprehensive overview of ROLL (Reinforcement Learning Optimization for Large-Scale Learning), explaining its architecture, core components, and capabilities. ROLL is a distributed reinforcement learning framework designed for training and fine-tuning Large Language Models (LLMs) at scale using GPU clusters.
This page covers the high-level system architecture and component organization. For detailed setup instructions, see Getting Started. For specific pipeline implementations, see Training Pipelines. For distributed execution details, see Distributed Execution.
Sources: README.md1-228
ROLL is an efficient, production-grade RL library that leverages distributed GPU resources to enhance LLM capabilities in three primary areas:
The framework is built on a multi-role distributed architecture using Ray for resource allocation and task scheduling. It integrates multiple state-of-the-art backends including Megatron-Core for training, and vLLM/SGLang for high-throughput inference.
Sources: README.md34-38
The architecture consists of five distinct layers:
Sources: README.md34-152
ROLL provides five distinct pipeline implementations, each targeting specific training objectives:
| Pipeline | Primary Use Case | Key Features | Example Config |
|---|---|---|---|
RLVRPipeline | Multi-domain RL training | Sample-level async rollout, dynamic sampling, domain-specific rewards | examples/qwen2.5-7B-rlvr_megatron/rlvr_config.yaml |
AgenticPipeline | Multi-turn agentic RL | Environment-level async rollout, trajectory-wise/step-wise learning | examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake.yaml |
DistillPipeline | Knowledge distillation | Teacher-student logits transfer, combined loss objectives | examples/qwen2.5-7B-distill_megatron/distill_vl_megatron.yaml |
DpoPipeline | Direct preference optimization | Preference-based training | examples/qwen2.5-3B-dpo_megatron/ |
SFTPipeline | Supervised fine-tuning | Standard SFT with Megatron integration | examples/qwen2.5-7B-sft_megatron/sft_config.yaml |
Each pipeline inherits from a common base and implements specific rollout, reward, and training logic appropriate to its use case.
Sources: README.md126-151
Workers are Ray remote actors that encapsulate model operations. Each worker type serves a specific role:
ActorWorker - Manages policy model for both training (gradient updates) and inference (rollout generation)CriticWorker - Computes value estimates for advantage calculation in actor-critic algorithmsRewardWorker - Evaluates generated responses using reward models or rule-based metricsEnvironmentWorker - Manages environment instances for agentic interactions (games, dialogues, tools)StudentWorker / TeacherWorker - Handle distillation pipeline operationsWorkers use the strategy pattern to abstract backend implementations, allowing runtime selection of Megatron, DeepSpeed, vLLM, or SGLang.
Sources: README.md138-142
The Cluster abstraction manages groups of workers as a single logical unit. Each cluster type coordinates a specific set of workers:
This abstraction enables flexible deployment topologies: co-located (training and inference on same GPUs), disaggregated (separate GPU pools), or hybrid configurations.
Sources: README.md138-142
ROLL provides over 20 RL algorithm implementations organized into four categories:
| Category | Algorithms | Characteristics |
|---|---|---|
| On-Policy | PPO, Lite PPO, TOPR | Critic-based, policy gradients with clipping |
| Off-Policy | GRPO, GSPO | Group-based optimization, no critic needed |
| Trajectory-Based | Reinforce++, RAFT++, StarPO | Episode-level rewards, trajectory rollouts |
| Step-Based | GiGPO | Per-step optimization for agentic tasks |
Algorithm selection is controlled through configuration parameters. The advantage_estimator field determines which advantage computation method is used, while reward_normalization controls how rewards are normalized across batches or groups.
For detailed algorithm documentation, see RL Algorithms.
Sources: README.md94-104 README.md135-137
ROLL integrates multiple backend systems to support different hardware configurations and performance requirements:
Megatron-Core (Megatron-LM Strategy)
MegatronTrainStrategyDeepSpeed (DeepSpeed Strategy)
DeepSpeedTrainStrategyFSDP (Under Development)
vLLM (vLLM Integration)
VllmStrategySGLang (SGLang Integration)
SgLangStrategySources: README.md138-144
ROLL uses Ray as its distributed execution foundation, providing:
For detailed distributed execution mechanisms, see Distributed Execution.
Sources: README.md138-142
The MCA (Megatron-Core Adapter) system enables seamless conversion between HuggingFace and Megatron-Core formats:
Supported model families include:
For detailed adapter documentation, see Megatron-Core Adapter System.
Sources: README.md65-66 README.md140-142
ROLL uses a hierarchical YAML-based configuration system:
Configuration flows from YAML files through pipeline-specific configs to worker-level settings:
For detailed configuration documentation, see Configuration System.
Sources: README.md76
ROLL uses DataProto as the standard data transfer protocol between distributed components:
DataProto separates tensor and non-tensor data for efficient transfer:
This separation enables zero-copy transfers for tensor data via CUDA IPC while keeping metadata on CPU.
For detailed data processing documentation, see Data Processing.
Sources: README.md138-142
domain_batch_size distributionSources: README.md125-152
To begin using ROLL, follow this recommended path:
For detailed setup instructions, see Getting Started.
Sources: README.md70-85
Sources: README.md42-85 README.md166-175
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.