NVIDIA · gshennvm · Aug 30, 2024 · Aug 30, 2024
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,74 +6,76 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 ## [Next Version]
 - Implement reward-aware preference optimization.
 - Fix log probs mismatch issue between policy and reference policy in DPO & variants.
-
-### New features and optimizations
-- Critic and Reward Model server refactored. Now the reward model will have a flag called `model.forward_micro_batch_size` which determines the micro batch size that it runs inferences with. This can be higher than the training micro batch size since during inference we have less memory pressure.
-- In the critic and reward model server it is now possible to specify `inference_micro_batch_size` as a list, this allows us to give more information to PyTriton on the preferred batch sizes we want to run inference with.
+- Added TRT-LLM support in PPO. This can be enabled by `trainer.ppo.trt_llm.enable=True`. There is also a reshard option to reshard out pipeline parallelism during inference (i.e running tensor and data parallel only) for further speedup via `trainer.ppo.trt_llm.reshard=True`.
+- PPO algorithm will now double check that generated samples ended with one of the stop words from `sampling_params.end_strings`, and zero out their gradients if this is not the case (which happens when reaching the maximum generation length)
+- Added critic warmup to the PPO with the flag trainer.ppo.critic_warmup_steps.
+- PPO log probs are now computed with `higher_stability=True`. This can change results for some models, but should result in overall greater stability.
+
+### New Features and Optimizations
+- Critic and Reward Model server refactored. Now the reward model will have a flag called `model.forward_micro_batch_size` which determines the micro batch size on which it runs inferences. This can be higher than the training micro batch size since during inference, we have less memory pressure.
+- In the critic and reward model server, it is now possible to specify `inference_micro_batch_size` as a list.  This allows us to provide more information to PyTriton regarding the preferred batch sizes for inference.
 - It is no longer a requirement to specify `num_rollout_samples` to be a multiple of `inference_micro_batch_size * dp size` in PPO.
 
-### Breaking changes
-- `inference.micro_batch_size` is now renamed to `inference.inference_micro_batch_size` when running reward model inference in `inference_rm.yaml` this is to stay consistent with the naming scheme of the PPO critic.
+### Breaking Changes
+- `inference.micro_batch_size` is now renamed to `inference.inference_micro_batch_size` when running reward model inference in `inference_rm.yaml`.  This is to stay consistent with the naming scheme of the PPO critic.
 - It is no longer possible to specify `add_EOS` when running reward model or critic inference.
-- Aligner now requires Megatron-LM>=0.8.0 for the APIs to calculate the microbatch sizes
+- NeMo-Aligner now requires Megatron-LM>=0.8.0 for the APIs to calculate the microbatch sizes.
 
 ### Bug Fixes
 - Make `num_workers` for dataloaders 0 by default. This prevents issues when using MPI (with TRT-LLM) or more sophisticated launchers.
 
 ## [0.3.1] - 2024-05
-- SPIN: added `rollout_micro_batch_size` parameter which allows users to set the batch size for doing generation during SPIN training.
-        previously the generation batch size was automatically set to the data parallel size (DP) of the model
-- SPIN: added wandb logging of average generation length and a small sample of generated responses (in plaintext) along with corresponding prompts
+- SPIN: added `rollout_micro_batch_size` parameter which allows users to set the batch size for doing generation during SPIN training. Previously, the generation batch size was automatically set to the data parallel size (DP) of the model.
+- SPIN: added wandb logging of average generation length and a small sample of generated responses (in plaintext) along with their corresponding prompts.
 
-### New features and optimizations
+### New Features and Optimizations
 - Add MoE Support for our reward models.
-- SFT/SteerLM: LoRA can now be enabled on all model layers
-- DPO: Enable LoRA on all model layers (In this case the actor will be reference model + LoRA weights, we can switch between actor/reference model by enabling/disabling LoRA)
-- PPO: Enable LoRA on all model layers (In this case the actor will be init policy + LoRA weights, we can switch between actor/init_policy model by enabling/disabling LoRA)
+- SFT/SteerLM: LoRA can now be enabled on all model layers.
+- DPO: Enable LoRA on all model layers. In this case, the actor will be a reference model plus LoRA weights. We can switch between the actor/reference model by enabling or disabling LoRA.
+- PPO: Enable LoRA on all model layers. In this case, the actor will be the init policy plus LoRA weights. We can switch between the actor/init_policy model by enabling or disabling LoRA.
 - SteerLM 2.0: Add the SteerLM 2.0 model alignment method.
-- Added support for float values for `val_check_interval` for SFT
-- Added support for `limit_train_batches` as a float or int to DPO, SPIN, and SFT. This functionality mirrors the same parameter in PTL
-### Breaking changes
+- `val_check_interval` in SFT now supports float values.
+- Added support for `limit_train_batches` as a float or int to DPO, SPIN, and SFT. This functionality mirrors the same parameter in PTL.
+
+### Breaking Changes
 
 ### Bug Fixes
-- Fixed issue where random sampler keeps state when resetting for validation, leading to a different validation batch each validation step. Fixed by using a deterministic sampler
-- Fixed crash with float val check interval in DPOTrainer
-- Fixed crash with float val check interval when checking progress in DPOTrainer
-- Fixed potential crash in SPIN when prompts are longer than encoder_seq_len - generation.max_length
-- Fixed crash when calling the `generate()` method of an SFT model with pipeline parallelism greater than two
-- Fixed crash when calling the `generate()` method of an SFT model with `compute_logprob=True` and string inputs
-- Fixed crash when `model.micro_batch_size` > 1 in DPO
+- Fixed issue where the random sampler keeps its state during validation resets, resulting in varying validation batches at each step. This was addressed by switching to a deterministic sampler.
+- Fixed crash with float val check interval in DPOTrainer.
+- Fixed crash with float val check interval when checking progress in DPOTrainer.
+- Fixed potential crash in SPIN when prompts are longer than encoder_seq_len - generation.max_length.
+- Fixed crash when calling the `generate()` method of an SFT model with pipeline parallelism greater than two.
+- Fixed crash when calling the `generate()` method of an SFT model with `compute_logprob=True` and string inputs.
+- Fixed crash when `model.micro_batch_size` > 1 in DPO.
 - Fixed issue when `model.encoder_seq_length` is mismatched with `model.data.train_ds.max_seq_length` in SFT and SPIN.
-- Delete MegatronPretrainingRandomSampler from Aligner since it has been upstreamed into NeMo
-- Fixed SPIN not correctly using its `val_check_interval` parameter
+- Delete MegatronPretrainingRandomSampler from NeMo-Aligner since it has been upstreamed into NeMo.
+- Fixed SPIN not correctly using its `val_check_interval` parameter.
 
 ## [0.3.0] - 2024-05
 
-### New features and optimizations
+### New Features and Optimizations
 - Special TRT-LLM release. See [Accelerated-RLHF](https://github.com/NVIDIA/NeMo-Aligner/blob/v0.3.0.trtllm/Accelerated-RLHF.md) and [Accelerated-RLHF-Release](https://github.com/NVIDIA/NeMo-Aligner/releases/tag/v0.3.0.trtllm) for more details.
 
 ## [0.2.0] - 2024-02
-### New features and optimizations
+### New Features and Optimizations
 - Added public-facing official Dockerfile for NeMo-Aligner.
 - PPO: memory optimization to help avoid OOM in the actor when sending training data to the critic.
 - PPO: it is now possible to use a custom end string in `sampling_params.end_strings` that is different from `<extra_id_1>`.
 - SFT: added support for custom validation metrics based on model generations.
-- Added the ability to do multi-epoch (cfg.max_epochs > 1) training for reward models, DPO, PPO, and SFT
-- Added the SPIN (Self-Play Fine Tuning) algorithm (https://arxiv.org/abs/2401.01335) which allows SPIN SFT training using SFT-format dataset files
-- SFT/SteerLM: added LoRA tuning as an option besides full fine-tuning, only attention_qkv layer is supported
+- Added the ability to do multi-epoch (cfg.max_epochs > 1) training for reward models, DPO, PPO, and SFT.
+- Added the SPIN (Self-Play Fine Tuning) algorithm (https://arxiv.org/abs/2401.01335) which allows SPIN SFT training using SFT-format dataset files.
+- SFT/SteerLM: added LoRA tuning as an option besides full fine-tuning, only attention_qkv layer is supported.
 
-### Breaking changes
-- We have changed the shuffle logic in the data sampler to support multi-epoch training, so training runs using identical parameters
-  will not give the same results anymore because the shuffle logic has changed (specifically the seed value is modified slightly per epoch).
-  If you run CI/regression type tests, then be warned that the test may break due to this shuffle change.
+### Breaking Changes
+- We have changed the shuffle logic in the data sampler to support multi-epoch training, so training runs using identical parameters. It will no longer give the same results because the shuffle logic has changed (specifically the seed value is modified slightly per epoch). If you run CI/regression type tests, be warned that the test may break due to this shuffle change.
 
 ### Bug Fixes
 - Fixed a potential issue when the base model's `model.data.data_prefix` config is a list and is about to be overridden with
 a dictionary from the training configuration.
-- `exp_manager.max_time_per_run` is now respected, the trainers will save and run validation before exiting if we've reached the time limit.
+- `exp_manager.max_time_per_run` is now respected. The trainers will save and run the validation before exiting if the time limit has been reached.
 - Fixed crash in PPO when using a separate reward model server (i.e., with `combine_rm_and_critic_server=False`).
-- Fixed crash when LR scheduler is not specified
+- Fixed crash when LR scheduler is not specified.
 
 ## [0.1.0] - 2023-12-04
 ### Added
-- First open source release
+- First open source release.
diff --git a/Dockerfile b/Dockerfile
@@ -1,15 +1,16 @@
-# CUDA 12.3
-FROM nvcr.io/nvidia/pytorch:24.02-py3
+ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:24.03-py3
 
-### config tags
-ARG APEX_TAG=810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c
-ARG TE_TAG=a51ff542dcb1f605aa54f9b0e1aaadb132acd53d
-ARG MLM_TAG=core_r0.7.0
-ARG NEMO_TAG=r2.0.0rc0
-ARG PYTRITON_VERSION=0.5.5
-ARG PROTOBUF_VERSION=4.24.4
+FROM ${BASE_IMAGE}
+ARG APEX_TAG=59b80ee8df79cec125794949327f29913c328746
+ARG TE_TAG=7d576ed25266a17a7b651f2c12e8498f67e0baea
+ARG MLM_TAG=a3fe0c75df82218901fa2c3a7c9e389aa5f53182  # On: core_r0.8.0
+ARG NEMO_TAG=e033481e26e6ae32764d3e2b3f16afed00dc7218  # On: r2.0.0rc1
 ARG ALIGNER_COMMIT=main
 
+ARG PYTRITON_VERSION=0.5.10
+ARG PROTOBUF_VERSION=4.24.4
+ARG TRTLLM_VERSION=v0.10.0
+
 # if you get errors building TE or Apex, decrease this to 4
 ARG MAX_JOBS=8
 
@@ -77,4 +78,24 @@ RUN git clone https://github.com/NVIDIA/NeMo-Aligner.git && \
     fi && \
     pip install --no-deps -e .
 
-WORKDIR /workspace
+# Git LFS
+RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash && \
+    apt-get install git-lfs && \
+    git lfs install
+
+# TRTLLM
+RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git && \
+    cd TensorRT-LLM && \
+    git checkout ${TRTLLM_VERSION} && \
+    patch -p1 < ../NeMo-Aligner/setup/trtllm.patch && \
+    . docker/common/install_tensorrt.sh && \
+    python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt 
+
+RUN cd TensorRT-LLM && \
+    pip install ./build/tensorrt_llm*.whl
+ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12/compat/lib.real/
+
+# WAR(0.4.0): The pin of NeMo requires a higher nvidia-modelopt version than
+#             TRT-LLM allows. This installation must follow TRT-LLM and is
+#             only necessary when NeMo 2.0.0rc1 is installed with TRT-LLM v10.
+RUN pip install --upgrade-strategy only-if-needed nvidia-modelopt==0.13.0
diff --git a/README.md b/README.md
@@ -7,22 +7,22 @@
 
 ## Introduction
 
-NeMo-Aligner is a scalable toolkit for efficient model alignment. The toolkit has support for state of the art model alignment algorithms such as SteerLM, DPO and Reinforcement Learning from Human Feedback (RLHF). These algorithms enable users to align language models to be more safe, harmless and helpful. Users can do end-to-end model alignment on a wide range of model sizes and take advantage of all the parallelism techniques to ensure their model alignment is done in a performant and resource efficient manner. For more technical details, please refer to our [paper](https://arxiv.org/abs/2405.01481).
+NeMo-Aligner is a scalable toolkit for efficient model alignment. The toolkit has support for state-of-the- art model alignment algorithms such as SteerLM, DPO, and Reinforcement Learning from Human Feedback (RLHF). These algorithms enable users to align language models to be more safe, harmless, and helpful. Users can perform end-to-end model alignment on a wide range of model sizes and take advantage of all the parallelism techniques to ensure their model alignment is done in a performant and resource-efficient manner. For more technical details, please refer to our [paper](https://arxiv.org/abs/2405.01481).
 
-NeMo-Aligner toolkit is built using the [NeMo Toolkit](https://github.com/NVIDIA/NeMo) which allows for scaling training up to 1000s of GPUs using tensor, data and pipeline parallelism for all components of alignment. All of our checkpoints are cross compatible with the NeMo ecosystem; allowing for inference deployment and further customization.
+The NeMo-Aligner toolkit is built using the NeMo Framework, which enables scalable training across thousands of GPUs using tensor, data, and pipeline parallelism for all alignment components. Additionally, our checkpoints are cross-compatible with the NeMo ecosystem, facilitating inference deployment and further customization (https://github.com/NVIDIA/NeMo-Aligner).
 
-The toolkit is currently in it's early stages, and we are committed to improving the toolkit to make it easier for developers to pick and choose different alignment algorithms to build safe, helpful and reliable models.
+The toolkit is currently in it's early stages. We are committed to improving the toolkit to make it easier for developers to pick and choose different alignment algorithms to build safe, helpful, and reliable models.
 
-## Key features
+## Key Features
 
-* **SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF.** 
-    * [Llama3-70B-SteerLM-Chat](https://huggingface.co/nvidia/Llama3-70B-SteerLM-Chat) aligned with NeMo Aligner.
-    * Corresponding reward model [Llama3-70B-SteerLM-RM](https://huggingface.co/nvidia/Llama3-70B-SteerLM-RM)
+* **SteerLM: Attribute Conditioned SFT as an (User-Steerable) alternative to RLHF** 
+    * [Llama3-70B-SteerLM-Chat](https://huggingface.co/nvidia/Llama3-70B-SteerLM-Chat) aligned with NeMo-Aligner.
+    * Corresponding reward model [Llama3-70B-SteerLM-RM](https://huggingface.co/nvidia/Llama3-70B-SteerLM-RM).
     * Learn more at our [SteerLM](https://arxiv.org/abs/2310.05344) and [HelpSteer2](https://arxiv.org/abs/2406.08673) papers.
 * **Supervised Fine Tuning**
 * **Reward Model Training**
 * **Reinforcement Learning from Human Feedback using the [PPO](https://arxiv.org/pdf/1707.06347.pdf) Algorithm**
-    * [Llama3-70B-PPO-Chat](https://huggingface.co/nvidia/Llama3-70B-PPO-Chat) aligned with NeMo Aligner.
+    * [Llama3-70B-PPO-Chat](https://huggingface.co/nvidia/Llama3-70B-PPO-Chat) aligned with NeMo-Aligner.
 * **Direct Preference Optimization** as described in [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/pdf/2305.18290)
     * [Llama3-70B-DPO-Chat](https://huggingface.co/nvidia/Llama3-70B-DPO-Chat) aligned with NeMo Aligner.
 * **Self-Play Fine-Tuning (SPIN)** as described in [Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models](https://arxiv.org/pdf/2401.01335)
@@ -35,39 +35,38 @@ The toolkit is currently in it's early stages, and we are committed to improving
 
 ## Latest Release
 
-For the latest stable release please see the [releases page](https://github.com/NVIDIA/NeMo-Aligner/releases). All releases come with a pre-built container. Changes within each release will be documented in [CHANGELOG](https://github.com/NVIDIA/NeMo-Aligner/blob/main/CHANGELOG.md).
+For the latest stable release, please see the [releases page](https://github.com/NVIDIA/NeMo-Aligner/releases). All releases come with a pre-built container. Changes within each release will be documented in [CHANGELOG](https://github.com/NVIDIA/NeMo-Aligner/blob/main/CHANGELOG.md).
 
-## Installing your own environment
+## Install Your Own Environment
 
 ### Requirements
 NeMo-Aligner has the same requirements as the [NeMo Toolkit Requirements](https://github.com/NVIDIA/NeMo#requirements) with the addition of [PyTriton](https://github.com/triton-inference-server/pytriton).
 
-### Installation
-Please follow the same steps as the [NeMo Toolkit Installation Guide](https://github.com/NVIDIA/NeMo#installation) but run the following after installing NeMo
+### Install NeMo-Aligner
+Please follow the same steps as outlined in the [NeMo Toolkit Installation Guide](https://github.com/NVIDIA/NeMo#installation).  After installing NeMo, execute the following additional command:
 ```bash
 pip install nemo-aligner
 ```
-or if you prefer to install the latest commit
+Alternatively, if you prefer to install the latest commit:
 ```bash
 pip install .
 ```
 
 ### Docker Containers
 
-We provide an official NeMo-Aligner Dockerfile which is based on stable, tested versions of NeMo, Megatron-LM, and TransformerEngine. The goal of this Dockerfile
-is stability, so it may not track the very latest versions of those 3 packages. You can access our Dockerfile [here](https://github.com/NVIDIA/NeMo-Aligner/blob/main/Dockerfile)
+We provide an official NeMo-Aligner Dockerfile which is based on stable, tested versions of NeMo, Megatron-LM, and TransformerEngine. The primary objective of this Dockerfile is to ensure stability, although it might not always reflect the very latest versions of those three packages. You can access our Dockerfile [here](https://github.com/NVIDIA/NeMo-Aligner/blob/main/Dockerfile).
 
 Alternatively, you can build the NeMo Dockerfile here [NeMo Dockerfile](https://github.com/NVIDIA/NeMo/blob/main/Dockerfile) and add `RUN pip install nemo-aligner` at the end.
 
 ## Future work
-- Add Rejection Sampling support
+- Add Rejection Sampling support.
 - We will continue improving the stability of the PPO learning phase.
-- Improve the performance of RLHF
+- Improve the performance of RLHF.
 
-## Contributing
+## Contribute to NeMo-Aligner
 We welcome community contributions! Please refer to [CONTRIBUTING.md](https://github.com/NVIDIA/NeMo-Aligner/blob/main/CONTRIBUTING.md) for guidelines.
 
-## Citing NeMo-Aligner
+## Cite NeMo-Aligner in Your Work
 ```
 @misc{shen2024nemoaligner,
       title={NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment},