diff --git a/docs/source/nlp/distillation.rst b/docs/source/nlp/distillation.rst deleted file mode 100644 index 22b2f3dd8a1c..000000000000 --- a/docs/source/nlp/distillation.rst +++ /dev/null @@ -1,58 +0,0 @@ -.. _megatron_distillation: - -Distillation -========================== - -Knowledge Distillation (KD) --------------------------------- - -KD involves using information from an existing trained model to train a second (usually smaller, faster) model, thereby "distilling" knowledge from one to the other. - -Distillation has two primary benefits: faster convergence and higher end accuracy than traditional training. - -In NeMo, distillation is enabled by the `NVIDIA TensorRT Model Optimizer (ModelOpt) `_ library -- a library to optimize deep-learning models for inference on GPUs. - -The logits-distillation process consists of the following steps: - -1. Loading both student and teacher model checkpoints (must support same parallelism strategy, if any) -2. Training until convergence, where forward passes are run on both models (and backward only on student), performing a specific loss function between the logits. -3. Saving the final student model. - - -Example -^^^^^^^ -The example below shows how to run the distillation script for LLama models. - -The script must be launched correctly with the number of processes equal to tensor parallelism. This is achieved with the ``torchrun`` command below: - -.. code-block:: bash - - STUDENT_CKPT="path/to/student.nemo" # can also be None (will use default architecture found in examples/nlp/language_modeling/conf/megatron_llama_distill.yaml) - TEACHER_CKPT="path/to/teacher.nemo" - TOKENIZER="path/to/tokenizer.model" - DATA_PATHS="[1.0,path/to/tokenized/data]" - FINAL_SAVE_FILE="final_checkpoint.nemo" - TP=4 - - NPROC=$TP - launch_config="torchrun --nproc_per_node=$NPROC" - - ${launch_config} examples/nlp/language_modeling/megatron_gpt_distillation.py \ - model.restore_from_path=$STUDENT_CKPT \ - model.kd_teacher_restore_from_path=$TEACHER_CKPT \ - model.tensor_model_parallel_size=$TP \ - model.tokenizer.model=$TOKENIZER \ - model.data.data_prefix=$DATA_PATHS \ - model.nemo_path=$FINAL_SAVE_FILE \ - trainer.precision=bf16 \ - trainer.devices=$NPROC - -For large models, the command can be used in multi-node setting. For example, this can be done with `NeMo Framework Launcher `_ using Slurm. - - -Limitations -^^^^^^^^^^^ -* Only Megatron Core-based GPT models are supported -* Only logit-pair distillation is supported for now -* Pipeline parallelism not yet supported -* FSDP strategy not yet supported diff --git a/docs/source/nlp/nemo_megatron/model_distillation/drop_layers.rst b/docs/source/nlp/nemo_megatron/model_distillation/drop_layers.rst deleted file mode 100644 index 3dc008945cc9..000000000000 --- a/docs/source/nlp/nemo_megatron/model_distillation/drop_layers.rst +++ /dev/null @@ -1,67 +0,0 @@ -.. _drop_layers: - -Drop Model Layers ------------------ - -To trim the model layers, use the following script: - -.. code-block:: bash - - python -m torch.distributed.launch --nproc_per_node= * \ - /NeMo/examples/nlp/language_modeling/megatron_gpt_drop_layers.py \ - --path_to_nemo /path/to/model.nemo \ - --path_to_save /path/to/save/trimmed_model.nemo \ - --tensor_model_parallel_size \ - --pipeline_model_parallel_size \ - --gpus_per_node \ - --drop_layers 1 2 3 4 - -**Note:** layer indices start from 1. - -To save trimmed model in ``zarr`` checkpoint format, add the following flag to the command above: - -.. code-block:: bash - - --zarr - -**Note:** the ``zarr`` checkpoint format is deprecated. - -Validate Trimmed Model ----------------------- - -To validate the trimmed model, use the following script: - -.. code-block:: bash - - python /NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \ - --config-path=/path/to/folder/with/model/config \ - --config-name=model_config.yaml \ - trainer.limit_val_batches= \ - model.restore_from_path=/path/to/trimmed_model.nemo \ - model.skip_train=True \ - model.data.data_impl=mock \ - model.data.data_prefix=[] - -To use a specific dataset instead of a mock dataset, modify the ``model.data`` parameters as follows: - -.. code-block:: bash - - model.data.data_impl=mmap \ - model.data.data_prefix=["path/to/datafile1", "path/to/datafile2"] - -Validate Original Model ------------------------ - -To validate the original model without specific layers, use the following script: - -.. code-block:: bash - - python /NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \ - --config-path=/path/to/folder/with/model/config \ - --config-name=model_config.yaml \ - trainer.limit_val_batches= \ - model.restore_from_path=/path/to/original_model.nemo \ - model.skip_train=True \ - model.data.data_impl=mock \ - model.data.data_prefix=[] \ - model.drop_layers=[1,2,3,4]