You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/design-docs/generation.md
+88-4Lines changed: 88 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Generation Interface
2
2
3
-
This document explains the token generation interface and various backends for the NeMo RL framework. The generation system is designed with a unified interface that allows different backends (like VLLM, Hugging Face, SGLang, and TRT-LLM) to provide token generation capabilities while adhering to the same API.
3
+
This document explains the token generation interface and various backends for the NeMo RL framework. The generation system is designed with a unified interface that allows different backends (like VLLM, Megatron, Hugging Face, SGLang, and TRT-LLM) to provide token generation capabilities while adhering to the same API.
4
4
5
5
## Generation Interface
6
6
@@ -12,7 +12,7 @@ The core of the generation system is defined in `interfaces.py`, which establish
12
12
```python
13
13
classGenerationConfig(TypedDict):
14
14
"""Configuration for generation."""
15
-
backend: str# The backend to use (e.g., "vllm", "hf")
15
+
backend: str# The backend to use (e.g., "vllm", "megatron", "hf")
16
16
max_new_tokens: int# Maximum number of tokens to generate
17
17
temperature: float# Sampling temperature
18
18
top_p: float# Top-p sampling parameter
@@ -60,6 +60,10 @@ The core of the generation system is defined in `interfaces.py`, which establish
60
60
61
61
A key design principle for generation backends is that they process tokens directly, without involving the tokenizer. By ensuring that only tokens are exchanged, we eliminate the risk of inconsistencies arising from different tokenizer versions or specifications between the training and generation frameworks.
62
62
63
+
## Generation Backends
64
+
65
+
NeMo RL supports multiple generation backends that implement the {py:class}`GenerationInterface <nemo_rl.models.generation.interfaces.GenerationInterface>` to provide efficient text generation for different use cases.
66
+
63
67
## VLLM Backend
64
68
65
69
The VLLM backend (`models/generation/vllm/vllm_generation.py`) implements the {py:class}`GenerationInterface <nemo_rl.models.generation.interfaces.GenerationInterface>` to provide efficient text generation using the VLLM library, which is optimized for large language models.
@@ -90,9 +94,63 @@ The {py:class}`UpdatableVllmInternalWorker <nemo_rl.models.generation.vllm_backe
90
94
2. Updating weights from IPC handles for efficient weight sharing.
91
95
3. Checking if weights have been updated correctly.
92
96
93
-
## Usage Example
97
+
## Megatron Backend
98
+
99
+
The Megatron backend provides native Megatron-Core inference capabilities, eliminating the need for weight conversion between training and generation. This backend is particularly beneficial when using Megatron for training, as it enables seamless integration and optimal performance.
100
+
101
+
### Key Features
102
+
103
+
1.**No Weight Conversion**: Uses the same Megatron model format for both training and generation, eliminating conversion overhead and potential inconsistencies.
104
+
2.**CUDA Graph Support**: Leverages CUDA graphs for optimized inference performance.
4.**Integrated with Training**: The generation capability is built directly into the `MegatronPolicyWorker`, enabling efficient co-located training and generation.
107
+
108
+
### MegatronPolicyWorker Generation
94
109
95
-
To use a generation backend:
110
+
The Megatron generation backend is implemented within the {py:class}`MegatronPolicyWorker <nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker>` class. The `generate <nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker.generate>` method performs the following:
111
+
112
+
1. Wraps the Megatron model with `GPTInferenceWrapper` for inference optimization.
113
+
2. Creates a `DynamicInferenceContext` to manage inference state and memory.
114
+
3. Initializes a `DynamicInferenceEngine` with CUDA graph support enabled.
5. Returns outputs conforming to {py:class}`GenerationOutputSpec <nemo_rl.models.generation.interfaces.GenerationOutputSpec>`.
117
+
118
+
### Configuration
119
+
120
+
To use the Megatron generation backend, configure your YAML file as follows:
121
+
122
+
```yaml
123
+
policy:
124
+
megatron_cfg:
125
+
enabled: true
126
+
generation:
127
+
backend: megatron
128
+
max_new_tokens: 512
129
+
temperature: 1.0
130
+
top_p: 1.0
131
+
top_k: null
132
+
mcore_generation_config:
133
+
buffer_size_gb: 20# Memory buffer size for inference context
134
+
buffer_guaranteed_fraction: 0.1# Fraction of buffer guaranteed to be available for active requests
135
+
num_cuda_graphs: 16# Number of CUDA graphs to pre-allocate
136
+
max_tokens: 16384# Maximum number of tokens for inference
137
+
```
138
+
139
+
### Configuration Parameters
140
+
141
+
The `mcore_generation_config` section controls Megatron Core inference engine behavior:
142
+
143
+
- **buffer_size_gb**: Total memory buffer size (in GB) allocated for the dynamic inference context. This determines how much GPU memory is reserved for KV caches and intermediate states. Keeping this higher will pull in more requests at once.
144
+
- **buffer_guaranteed_fraction**: Fraction of the buffer that is guaranteed to be available (between 0.0 and 1.0). This helps to make sure that there is always some memory for active requests to complete.
145
+
- **num_cuda_graphs**: Number of CUDA graphs to pre-allocate for different batch sizes. More graphs can improve performance by avoiding runtime graph capture, but consume more memory.
146
+
- **max_tokens**: Maximum total number of tokens (across all requests) that can be processed simultaneously. This limits the maximum batch size and sequence length combinations. Increasing this might throw OOM depending on vocab size and buffer size allocated.
147
+
148
+
149
+
## Usage Examples
150
+
151
+
### Using VLLM Backend
152
+
153
+
To use the VLLM generation backend:
96
154
97
155
```python
98
156
from nemo_rl.algorithms.utils import get_tokenizer
0 commit comments