Overview

Relevant source files

Purpose and Scope

This document provides a comprehensive overview of ROLL (Reinforcement Learning Optimization for Large-Scale Learning), explaining its architecture, core components, and capabilities. ROLL is a distributed reinforcement learning framework designed for training and fine-tuning Large Language Models (LLMs) at scale using GPU clusters.

This page covers the high-level system architecture and component organization. For detailed setup instructions, see Getting Started. For specific pipeline implementations, see Training Pipelines. For distributed execution details, see Distributed Execution.

Sources: README.md1-228

What is ROLL?

ROLL is an efficient, production-grade RL library that leverages distributed GPU resources to enhance LLM capabilities in three primary areas:

Human Preference Alignment - Training models to follow human preferences and values
Complex Reasoning - Improving multi-step reasoning across mathematics, coding, and general problem-solving
Agentic Interactions - Enabling multi-turn dialogues, tool use, and interactive environments

The framework is built on a multi-role distributed architecture using Ray for resource allocation and task scheduling. It integrates multiple state-of-the-art backends including Megatron-Core for training, and vLLM/SGLang for high-throughput inference.

Sources: README.md34-38

High-Level System Architecture

The architecture consists of five distinct layers:

User Interface Layer - YAML configuration files and pipeline entry points
Pipeline Orchestration Layer - High-level training workflows for different objectives
Distributed Execution Layer - Ray-based worker management and resource allocation
Backend Strategy Layer - Pluggable training and inference engines
Model & Data Layer - Model adaptation and data handling utilities

Sources: README.md34-152

Core Components

Pipeline Implementations

ROLL provides five distinct pipeline implementations, each targeting specific training objectives:

Pipeline	Primary Use Case	Key Features	Example Config
`RLVRPipeline`	Multi-domain RL training	Sample-level async rollout, dynamic sampling, domain-specific rewards	examples/qwen2.5-7B-rlvr_megatron/rlvr_config.yaml
`AgenticPipeline`	Multi-turn agentic RL	Environment-level async rollout, trajectory-wise/step-wise learning	examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake.yaml
`DistillPipeline`	Knowledge distillation	Teacher-student logits transfer, combined loss objectives	examples/qwen2.5-7B-distill_megatron/distill_vl_megatron.yaml
`DpoPipeline`	Direct preference optimization	Preference-based training	examples/qwen2.5-3B-dpo_megatron/
`SFTPipeline`	Supervised fine-tuning	Standard SFT with Megatron integration	examples/qwen2.5-7B-sft_megatron/sft_config.yaml

Each pipeline inherits from a common base and implements specific rollout, reward, and training logic appropriate to its use case.

Sources: README.md126-151

Worker Architecture

Workers are Ray remote actors that encapsulate model operations. Each worker type serves a specific role:

ActorWorker - Manages policy model for both training (gradient updates) and inference (rollout generation)
CriticWorker - Computes value estimates for advantage calculation in actor-critic algorithms
RewardWorker - Evaluates generated responses using reward models or rule-based metrics
EnvironmentWorker - Manages environment instances for agentic interactions (games, dialogues, tools)
StudentWorker / TeacherWorker - Handle distillation pipeline operations

Workers use the strategy pattern to abstract backend implementations, allowing runtime selection of Megatron, DeepSpeed, vLLM, or SGLang.

Sources: README.md138-142

Cluster Abstraction

The Cluster abstraction manages groups of workers as a single logical unit. Each cluster type coordinates a specific set of workers:

Initialization - Loads models, sets up distributed process groups, configures parallelism
Dispatch - Distributes batches/requests across workers with load balancing
Synchronization - Broadcasts parameter updates from training to inference workers
State Management - Offloads/reloads model states for memory efficiency in time-division multiplexing

This abstraction enables flexible deployment topologies: co-located (training and inference on same GPUs), disaggregated (separate GPU pools), or hybrid configurations.

Sources: README.md138-142

RL Algorithm Support

ROLL provides over 20 RL algorithm implementations organized into four categories:

Algorithm Categories

Category	Algorithms	Characteristics
On-Policy	PPO, Lite PPO, TOPR	Critic-based, policy gradients with clipping
Off-Policy	GRPO, GSPO	Group-based optimization, no critic needed
Trajectory-Based	Reinforce++, RAFT++, StarPO	Episode-level rewards, trajectory rollouts
Step-Based	GiGPO	Per-step optimization for agentic tasks

Algorithm Configuration

Algorithm selection is controlled through configuration parameters. The advantage_estimator field determines which advantage computation method is used, while reward_normalization controls how rewards are normalized across batches or groups.

For detailed algorithm documentation, see RL Algorithms.

Sources: README.md94-104 README.md135-137

Backend Systems

ROLL integrates multiple backend systems to support different hardware configurations and performance requirements:

Training Backends

Megatron-Core (Megatron-LM Strategy)

5D parallelism: Tensor Parallel (TP), Pipeline Parallel (PP), Context Parallel (CP), Expert Parallel (EP), Data Parallel (DP)
Distributed optimizer with parameter sharding
Sequence parallelism for long context support
Activation recomputation (checkpointing)
Implemented in MegatronTrainStrategy

DeepSpeed (DeepSpeed Strategy)

ZeRO stages 1/2/3 for memory efficiency
CPU and NVMe offloading
Gradient accumulation and mixed precision
Implemented in DeepSpeedTrainStrategy

FSDP (Under Development)

PyTorch native Fully Sharded Data Parallel
Planned for future integration

Inference Backends

vLLM (vLLM Integration)

Paged attention for efficient KV cache management
Continuous batching for high throughput
FP8 quantization support
Dynamic generation with async request handling
Implemented in VllmStrategy

SGLang (SGLang Integration)

RadixAttention for prefix caching
Fast structured generation
Implemented in SgLangStrategy

Backend Selection Example

Sources: README.md138-144

Distributed Execution Model

Ray-Based Architecture

ROLL uses Ray as its distributed execution foundation, providing:

Resource Allocation - Automatic GPU discovery and allocation across nodes
Actor Management - Workers as Ray remote actors with lifecycle management
Task Scheduling - Asynchronous task execution with dependency tracking
Fault Tolerance - Actor restart and checkpoint recovery

Communication Patterns

NCCL - Used for efficient gradient all-reduce and parameter broadcasts during training
CUDA IPC - Enables zero-copy tensor sharing between co-located workers on the same node
Ray ObjectRef - General-purpose serialization for configurations and metadata
DataProto - Custom protocol for structured batch data with tensor and non-tensor fields

For detailed distributed execution mechanisms, see Distributed Execution.

Sources: README.md138-142

Model Adaptation System

MCA Adapter Architecture

The MCA (Megatron-Core Adapter) system enables seamless conversion between HuggingFace and Megatron-Core formats:

Configuration Conversion - Maps HuggingFace configs to Megatron-Core architecture specifications
Weight Conversion - Handles tensor transformations (QKV packing, gate/up fusion, vocabulary mapping)
Distributed Sharding - Applies parallelism-specific sharding (TP/PP/EP) during conversion
Bidirectional Support - Converts models in both directions for checkpoint portability

Supported model families include:

Qwen2.5 (7B, 14B, 32B, 72B)
Qwen3 (8B, 14B, 32B)
Qwen3-MoE (30A3B, 235A22B)
Qwen2.5-VL and Qwen3-VL (vision-language models)

For detailed adapter documentation, see Megatron-Core Adapter System.

Sources: README.md65-66 README.md140-142

Configuration System

ROLL uses a hierarchical YAML-based configuration system:

Configuration flows from YAML files through pipeline-specific configs to worker-level settings:

Pipeline Configuration - Defines overall training objectives, datasets, and algorithm selection
Worker Configuration - Specifies backend strategy, resource allocation, and deployment topology
Strategy Arguments - Controls parallelism dimensions and backend-specific optimizations
Model/Training Arguments - Sets model paths, tokenizer, optimizer, and learning rate schedules

For detailed configuration documentation, see Configuration System.

Sources: README.md76

Data Flow and Processing

DataProto Structure

ROLL uses DataProto as the standard data transfer protocol between distributed components:

DataProto separates tensor and non-tensor data for efficient transfer:

Tensor Batch - GPU-transferable tensors (inputs, masks, labels)
Non-Tensor Batch - Metadata and text (prompts, responses, domain info)

This separation enables zero-copy transfers for tensor data via CUDA IPC while keeping metadata on CPU.

For detailed data processing documentation, see Data Processing.

Sources: README.md138-142

Key Features Summary

Multi-Task RL Training (RLVR)

Sample-level asynchronous parallel rollout
Domain-specific reward models (mathematics, coding, reasoning, QA, instruction)
Dynamic sampling with configurable domain_batch_size distribution
Asynchronous reward calculation

Agentic RL

Environment-level asynchronous parallel rollout
Supports games, multi-turn dialogues, tool use
TrajectoryWise (StarPO) and StepWise (GiGPO) training paradigms
Local debugging support for multi-turn interactions

Performance Optimizations

FP8 rollout support for inference acceleration
Extreme offload/reload capabilities for memory management
LoRA training support
AutoDeviceMapping for flexible GPU allocation
Co-located and disaggregated deployment topologies

Observability

SwanLab / WandB / TensorBoard integration
Per-domain and per-reward-type metrics tracking
Checkpoint saving and resuming

Sources: README.md125-152

Getting Started Roadmap

To begin using ROLL, follow this recommended path:

Installation - Set up dependencies and Docker images (Installation and Dependencies)
Quick Start - Run example configurations (Quick Start Examples)
Choose Pipeline - Select appropriate pipeline for your use case (Pipeline Selection Guide)
Configure System - Customize YAML configuration (Configuration System)
Monitor Training - Set up metrics and checkpointing (Monitoring and Debugging)

Example Quick Start

For detailed setup instructions, see Getting Started.

Sources: README.md70-85

Additional Resources

Documentation: https://alibaba.github.io/ROLL/
GitHub Repository: https://github.com/alibaba/ROLL
Tech Report: https://arxiv.org/abs/2506.06122
Related Papers:

Sources: README.md42-85 README.md166-175

Overview

Relevant source files

Purpose and Scope

Sources: README.md1-228

What is ROLL?

ROLL is an efficient, production-grade RL library that leverages distributed GPU resources to enhance LLM capabilities in three primary areas:

Human Preference Alignment - Training models to follow human preferences and values
Complex Reasoning - Improving multi-step reasoning across mathematics, coding, and general problem-solving
Agentic Interactions - Enabling multi-turn dialogues, tool use, and interactive environments

Sources: README.md34-38

High-Level System Architecture

The architecture consists of five distinct layers:

User Interface Layer - YAML configuration files and pipeline entry points
Pipeline Orchestration Layer - High-level training workflows for different objectives
Distributed Execution Layer - Ray-based worker management and resource allocation
Backend Strategy Layer - Pluggable training and inference engines
Model & Data Layer - Model adaptation and data handling utilities

Sources: README.md34-152

Core Components

Pipeline Implementations

ROLL provides five distinct pipeline implementations, each targeting specific training objectives:

Pipeline	Primary Use Case	Key Features	Example Config
`RLVRPipeline`	Multi-domain RL training	Sample-level async rollout, dynamic sampling, domain-specific rewards	examples/qwen2.5-7B-rlvr_megatron/rlvr_config.yaml
`AgenticPipeline`	Multi-turn agentic RL	Environment-level async rollout, trajectory-wise/step-wise learning	examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake.yaml
`DistillPipeline`	Knowledge distillation	Teacher-student logits transfer, combined loss objectives	examples/qwen2.5-7B-distill_megatron/distill_vl_megatron.yaml
`DpoPipeline`	Direct preference optimization	Preference-based training	examples/qwen2.5-3B-dpo_megatron/
`SFTPipeline`	Supervised fine-tuning	Standard SFT with Megatron integration	examples/qwen2.5-7B-sft_megatron/sft_config.yaml

Each pipeline inherits from a common base and implements specific rollout, reward, and training logic appropriate to its use case.

Sources: README.md126-151

Worker Architecture

Workers are Ray remote actors that encapsulate model operations. Each worker type serves a specific role:

ActorWorker - Manages policy model for both training (gradient updates) and inference (rollout generation)
CriticWorker - Computes value estimates for advantage calculation in actor-critic algorithms
RewardWorker - Evaluates generated responses using reward models or rule-based metrics
EnvironmentWorker - Manages environment instances for agentic interactions (games, dialogues, tools)
StudentWorker / TeacherWorker - Handle distillation pipeline operations

Workers use the strategy pattern to abstract backend implementations, allowing runtime selection of Megatron, DeepSpeed, vLLM, or SGLang.

Sources: README.md138-142

Cluster Abstraction

The Cluster abstraction manages groups of workers as a single logical unit. Each cluster type coordinates a specific set of workers:

Initialization - Loads models, sets up distributed process groups, configures parallelism
Dispatch - Distributes batches/requests across workers with load balancing
Synchronization - Broadcasts parameter updates from training to inference workers
State Management - Offloads/reloads model states for memory efficiency in time-division multiplexing

This abstraction enables flexible deployment topologies: co-located (training and inference on same GPUs), disaggregated (separate GPU pools), or hybrid configurations.

Sources: README.md138-142

RL Algorithm Support

ROLL provides over 20 RL algorithm implementations organized into four categories:

Algorithm Categories

Category	Algorithms	Characteristics
On-Policy	PPO, Lite PPO, TOPR	Critic-based, policy gradients with clipping
Off-Policy	GRPO, GSPO	Group-based optimization, no critic needed
Trajectory-Based	Reinforce++, RAFT++, StarPO	Episode-level rewards, trajectory rollouts
Step-Based	GiGPO	Per-step optimization for agentic tasks

Algorithm Configuration

For detailed algorithm documentation, see RL Algorithms.

Sources: README.md94-104 README.md135-137

Backend Systems

ROLL integrates multiple backend systems to support different hardware configurations and performance requirements:

Training Backends

Megatron-Core (Megatron-LM Strategy)

5D parallelism: Tensor Parallel (TP), Pipeline Parallel (PP), Context Parallel (CP), Expert Parallel (EP), Data Parallel (DP)
Distributed optimizer with parameter sharding
Sequence parallelism for long context support
Activation recomputation (checkpointing)
Implemented in MegatronTrainStrategy

DeepSpeed (DeepSpeed Strategy)

ZeRO stages 1/2/3 for memory efficiency
CPU and NVMe offloading
Gradient accumulation and mixed precision
Implemented in DeepSpeedTrainStrategy

FSDP (Under Development)

PyTorch native Fully Sharded Data Parallel
Planned for future integration

Inference Backends

vLLM (vLLM Integration)

Paged attention for efficient KV cache management
Continuous batching for high throughput
FP8 quantization support
Dynamic generation with async request handling
Implemented in VllmStrategy

SGLang (SGLang Integration)

RadixAttention for prefix caching
Fast structured generation
Implemented in SgLangStrategy

Backend Selection Example

Sources: README.md138-144

Distributed Execution Model

Ray-Based Architecture

ROLL uses Ray as its distributed execution foundation, providing:

Resource Allocation - Automatic GPU discovery and allocation across nodes
Actor Management - Workers as Ray remote actors with lifecycle management
Task Scheduling - Asynchronous task execution with dependency tracking
Fault Tolerance - Actor restart and checkpoint recovery

Communication Patterns

NCCL - Used for efficient gradient all-reduce and parameter broadcasts during training
CUDA IPC - Enables zero-copy tensor sharing between co-located workers on the same node
Ray ObjectRef - General-purpose serialization for configurations and metadata
DataProto - Custom protocol for structured batch data with tensor and non-tensor fields

For detailed distributed execution mechanisms, see Distributed Execution.

Sources: README.md138-142

Model Adaptation System

MCA Adapter Architecture

The MCA (Megatron-Core Adapter) system enables seamless conversion between HuggingFace and Megatron-Core formats:

Configuration Conversion - Maps HuggingFace configs to Megatron-Core architecture specifications
Weight Conversion - Handles tensor transformations (QKV packing, gate/up fusion, vocabulary mapping)
Distributed Sharding - Applies parallelism-specific sharding (TP/PP/EP) during conversion
Bidirectional Support - Converts models in both directions for checkpoint portability

Supported model families include:

Qwen2.5 (7B, 14B, 32B, 72B)
Qwen3 (8B, 14B, 32B)
Qwen3-MoE (30A3B, 235A22B)
Qwen2.5-VL and Qwen3-VL (vision-language models)

For detailed adapter documentation, see Megatron-Core Adapter System.

Sources: README.md65-66 README.md140-142

Configuration System

ROLL uses a hierarchical YAML-based configuration system:

Configuration flows from YAML files through pipeline-specific configs to worker-level settings:

Pipeline Configuration - Defines overall training objectives, datasets, and algorithm selection
Worker Configuration - Specifies backend strategy, resource allocation, and deployment topology
Strategy Arguments - Controls parallelism dimensions and backend-specific optimizations
Model/Training Arguments - Sets model paths, tokenizer, optimizer, and learning rate schedules

For detailed configuration documentation, see Configuration System.

Sources: README.md76

Data Flow and Processing

DataProto Structure

ROLL uses DataProto as the standard data transfer protocol between distributed components:

DataProto separates tensor and non-tensor data for efficient transfer:

Tensor Batch - GPU-transferable tensors (inputs, masks, labels)
Non-Tensor Batch - Metadata and text (prompts, responses, domain info)

This separation enables zero-copy transfers for tensor data via CUDA IPC while keeping metadata on CPU.

For detailed data processing documentation, see Data Processing.

Sources: README.md138-142

Key Features Summary

Multi-Task RL Training (RLVR)

Sample-level asynchronous parallel rollout
Domain-specific reward models (mathematics, coding, reasoning, QA, instruction)
Dynamic sampling with configurable domain_batch_size distribution
Asynchronous reward calculation

Agentic RL

Environment-level asynchronous parallel rollout
Supports games, multi-turn dialogues, tool use
TrajectoryWise (StarPO) and StepWise (GiGPO) training paradigms
Local debugging support for multi-turn interactions

Performance Optimizations

FP8 rollout support for inference acceleration
Extreme offload/reload capabilities for memory management
LoRA training support
AutoDeviceMapping for flexible GPU allocation
Co-located and disaggregated deployment topologies

Observability

SwanLab / WandB / TensorBoard integration
Per-domain and per-reward-type metrics tracking
Checkpoint saving and resuming

Sources: README.md125-152

Getting Started Roadmap

To begin using ROLL, follow this recommended path:

Installation - Set up dependencies and Docker images (Installation and Dependencies)
Quick Start - Run example configurations (Quick Start Examples)
Choose Pipeline - Select appropriate pipeline for your use case (Pipeline Selection Guide)
Configure System - Customize YAML configuration (Configuration System)
Monitor Training - Set up metrics and checkpointing (Monitoring and Debugging)

Example Quick Start

For detailed setup instructions, see Getting Started.

Sources: README.md70-85

Additional Resources

Documentation: https://alibaba.github.io/ROLL/
GitHub Repository: https://github.com/alibaba/ROLL
Tech Report: https://arxiv.org/abs/2506.06122
Related Papers:

Sources: README.md42-85 README.md166-175

Overview

Purpose and Scope

What is ROLL?

High-Level System Architecture

Core Components

Pipeline Implementations

Worker Architecture

Cluster Abstraction

RL Algorithm Support

Algorithm Categories

Algorithm Configuration

Backend Systems

Training Backends

Inference Backends

Backend Selection Example

Distributed Execution Model

Ray-Based Architecture

Communication Patterns

Model Adaptation System

MCA Adapter Architecture

Configuration System

Data Flow and Processing

DataProto Structure

Key Features Summary

Multi-Task RL Training (RLVR)

Agentic RL

Performance Optimizations

Observability

Getting Started Roadmap

Example Quick Start

Additional Resources

On this page

Overview

Purpose and Scope

What is ROLL?

High-Level System Architecture

Core Components

Pipeline Implementations

Worker Architecture

Cluster Abstraction

RL Algorithm Support

Algorithm Categories

Algorithm Configuration

Backend Systems

Training Backends

Inference Backends

Backend Selection Example

Distributed Execution Model

Ray-Based Architecture

Communication Patterns

Model Adaptation System

MCA Adapter Architecture

Configuration System

Data Flow and Processing

DataProto Structure

Key Features Summary

Multi-Task RL Training (RLVR)

Agentic RL

Performance Optimizations

Observability

Getting Started Roadmap

Example Quick Start

Additional Resources

On this page