Skip to content

Commit 843a06a

Browse files
shanmugamr1992Shanmugam Ramasamy
authored andcommitted
fix: ADDING DOCS (#1595)
Co-authored-by: Shanmugam Ramasamy <[email protected]> Signed-off-by: Zhuoran Yin <[email protected]>
1 parent 526e05f commit 843a06a

File tree

1 file changed

+88
-4
lines changed

1 file changed

+88
-4
lines changed

docs/design-docs/generation.md

Lines changed: 88 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Generation Interface
22

3-
This document explains the token generation interface and various backends for the NeMo RL framework. The generation system is designed with a unified interface that allows different backends (like VLLM, Hugging Face, SGLang, and TRT-LLM) to provide token generation capabilities while adhering to the same API.
3+
This document explains the token generation interface and various backends for the NeMo RL framework. The generation system is designed with a unified interface that allows different backends (like VLLM, Megatron, Hugging Face, SGLang, and TRT-LLM) to provide token generation capabilities while adhering to the same API.
44

55
## Generation Interface
66

@@ -12,7 +12,7 @@ The core of the generation system is defined in `interfaces.py`, which establish
1212
```python
1313
class GenerationConfig(TypedDict):
1414
"""Configuration for generation."""
15-
backend: str # The backend to use (e.g., "vllm", "hf")
15+
backend: str # The backend to use (e.g., "vllm", "megatron", "hf")
1616
max_new_tokens: int # Maximum number of tokens to generate
1717
temperature: float # Sampling temperature
1818
top_p: float # Top-p sampling parameter
@@ -60,6 +60,10 @@ The core of the generation system is defined in `interfaces.py`, which establish
6060

6161
A key design principle for generation backends is that they process tokens directly, without involving the tokenizer. By ensuring that only tokens are exchanged, we eliminate the risk of inconsistencies arising from different tokenizer versions or specifications between the training and generation frameworks.
6262

63+
## Generation Backends
64+
65+
NeMo RL supports multiple generation backends that implement the {py:class}`GenerationInterface <nemo_rl.models.generation.interfaces.GenerationInterface>` to provide efficient text generation for different use cases.
66+
6367
## VLLM Backend
6468

6569
The VLLM backend (`models/generation/vllm/vllm_generation.py`) implements the {py:class}`GenerationInterface <nemo_rl.models.generation.interfaces.GenerationInterface>` to provide efficient text generation using the VLLM library, which is optimized for large language models.
@@ -90,9 +94,63 @@ The {py:class}`UpdatableVllmInternalWorker <nemo_rl.models.generation.vllm_backe
9094
2. Updating weights from IPC handles for efficient weight sharing.
9195
3. Checking if weights have been updated correctly.
9296

93-
## Usage Example
97+
## Megatron Backend
98+
99+
The Megatron backend provides native Megatron-Core inference capabilities, eliminating the need for weight conversion between training and generation. This backend is particularly beneficial when using Megatron for training, as it enables seamless integration and optimal performance.
100+
101+
### Key Features
102+
103+
1. **No Weight Conversion**: Uses the same Megatron model format for both training and generation, eliminating conversion overhead and potential inconsistencies.
104+
2. **CUDA Graph Support**: Leverages CUDA graphs for optimized inference performance.
105+
3. **Dynamic Inference Engine**: Utilizes Megatron Core's `DynamicInferenceEngine` for efficient batched generation.
106+
4. **Integrated with Training**: The generation capability is built directly into the `MegatronPolicyWorker`, enabling efficient co-located training and generation.
107+
108+
### MegatronPolicyWorker Generation
94109

95-
To use a generation backend:
110+
The Megatron generation backend is implemented within the {py:class}`MegatronPolicyWorker <nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker>` class. The `generate <nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker.generate>` method performs the following:
111+
112+
1. Wraps the Megatron model with `GPTInferenceWrapper` for inference optimization.
113+
2. Creates a `DynamicInferenceContext` to manage inference state and memory.
114+
3. Initializes a `DynamicInferenceEngine` with CUDA graph support enabled.
115+
4. Processes batched requests with proper sampling parameters (temperature, top_k, top_p).
116+
5. Returns outputs conforming to {py:class}`GenerationOutputSpec <nemo_rl.models.generation.interfaces.GenerationOutputSpec>`.
117+
118+
### Configuration
119+
120+
To use the Megatron generation backend, configure your YAML file as follows:
121+
122+
```yaml
123+
policy:
124+
megatron_cfg:
125+
enabled: true
126+
generation:
127+
backend: megatron
128+
max_new_tokens: 512
129+
temperature: 1.0
130+
top_p: 1.0
131+
top_k: null
132+
mcore_generation_config:
133+
buffer_size_gb: 20 # Memory buffer size for inference context
134+
buffer_guaranteed_fraction: 0.1 # Fraction of buffer guaranteed to be available for active requests
135+
num_cuda_graphs: 16 # Number of CUDA graphs to pre-allocate
136+
max_tokens: 16384 # Maximum number of tokens for inference
137+
```
138+
139+
### Configuration Parameters
140+
141+
The `mcore_generation_config` section controls Megatron Core inference engine behavior:
142+
143+
- **buffer_size_gb**: Total memory buffer size (in GB) allocated for the dynamic inference context. This determines how much GPU memory is reserved for KV caches and intermediate states. Keeping this higher will pull in more requests at once.
144+
- **buffer_guaranteed_fraction**: Fraction of the buffer that is guaranteed to be available (between 0.0 and 1.0). This helps to make sure that there is always some memory for active requests to complete.
145+
- **num_cuda_graphs**: Number of CUDA graphs to pre-allocate for different batch sizes. More graphs can improve performance by avoiding runtime graph capture, but consume more memory.
146+
- **max_tokens**: Maximum total number of tokens (across all requests) that can be processed simultaneously. This limits the maximum batch size and sequence length combinations. Increasing this might throw OOM depending on vocab size and buffer size allocated.
147+
148+
149+
## Usage Examples
150+
151+
### Using VLLM Backend
152+
153+
To use the VLLM generation backend:
96154

97155
```python
98156
from nemo_rl.algorithms.utils import get_tokenizer
@@ -133,6 +191,32 @@ output = generator.generate(input_data, greedy=False)
133191
generator.finish_generation()
134192
```
135193

194+
### Using Megatron Backend
195+
196+
To use the Megatron generation backend, configure your YAML file:
197+
198+
```yaml
199+
policy:
200+
model_name: meta-llama/Llama-3.2-1B-Instruct
201+
megatron_cfg:
202+
enabled: true
203+
generation:
204+
backend: megatron
205+
max_new_tokens: 512
206+
temperature: 1.0
207+
top_p: 1.0
208+
top_k: null
209+
mcore_generation_config:
210+
buffer_size_gb: 20
211+
buffer_guaranteed_fraction: 0.1
212+
num_cuda_graphs: 16
213+
max_tokens: 16384
214+
```
215+
216+
For a complete example, see:
217+
- **Configuration**: `examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron_generation.yaml`
218+
- **Test Script**: `tests/functional/grpo_megatron_generation.sh`
219+
136220
## Extend with New Backends
137221

138222
To add a new generation backend:

0 commit comments

Comments
 (0)