diff --git a/docs/source/nlp/distillation.rst b/docs/source/nlp/distillation.rst
deleted file mode 100644
index 22b2f3dd8a1c..000000000000
--- a/docs/source/nlp/distillation.rst
+++ /dev/null
@@ -1,58 +0,0 @@
-.. _megatron_distillation:
-
-Distillation
-==========================
-
-Knowledge Distillation (KD)
---------------------------------
-
-KD involves using information from an existing trained model to train a second (usually smaller, faster) model, thereby "distilling" knowledge from one to the other.
-
-Distillation has two primary benefits: faster convergence and higher end accuracy than traditional training.
-
-In NeMo, distillation is enabled by the `NVIDIA TensorRT Model Optimizer (ModelOpt) <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_ library -- a library to optimize deep-learning models for inference on GPUs.
-
-The logits-distillation process consists of the following steps:
-
-1. Loading both student and teacher model checkpoints (must support same parallelism strategy, if any)
-2. Training until convergence, where forward passes are run on both models (and backward only on student), performing a specific loss function between the logits.
-3. Saving the final student model.
-
-
-Example
-^^^^^^^
-The example below shows how to run the distillation script for LLama models.
-
-The script must be launched correctly with the number of processes equal to tensor parallelism. This is achieved with the ``torchrun`` command below:
-
-.. code-block:: bash
-
-    STUDENT_CKPT="path/to/student.nemo"  # can also be None (will use default architecture found in examples/nlp/language_modeling/conf/megatron_llama_distill.yaml)
-    TEACHER_CKPT="path/to/teacher.nemo"
-    TOKENIZER="path/to/tokenizer.model"
-    DATA_PATHS="[1.0,path/to/tokenized/data]"
-    FINAL_SAVE_FILE="final_checkpoint.nemo"
-    TP=4
-
-    NPROC=$TP
-    launch_config="torchrun --nproc_per_node=$NPROC"
-
-    ${launch_config} examples/nlp/language_modeling/megatron_gpt_distillation.py \
-        model.restore_from_path=$STUDENT_CKPT \
-        model.kd_teacher_restore_from_path=$TEACHER_CKPT \
-        model.tensor_model_parallel_size=$TP \
-        model.tokenizer.model=$TOKENIZER \
-        model.data.data_prefix=$DATA_PATHS \
-        model.nemo_path=$FINAL_SAVE_FILE \
-        trainer.precision=bf16 \
-        trainer.devices=$NPROC
-
-For large models, the command can be used in multi-node setting. For example, this can be done with `NeMo Framework Launcher <https://github.com/NVIDIA/NeMo-Framework-Launcher>`_ using Slurm.
-
-
-Limitations
-^^^^^^^^^^^
-* Only Megatron Core-based GPT models are supported
-* Only logit-pair distillation is supported for now
-* Pipeline parallelism not yet supported
-* FSDP strategy not yet supported
diff --git a/docs/source/nlp/nemo_megatron/model_distillation/drop_layers.rst b/docs/source/nlp/nemo_megatron/model_distillation/drop_layers.rst
deleted file mode 100644
index 3dc008945cc9..000000000000
--- a/docs/source/nlp/nemo_megatron/model_distillation/drop_layers.rst
+++ /dev/null
@@ -1,67 +0,0 @@
-.. _drop_layers:
-
-Drop Model Layers
------------------
-
-To trim the model layers, use the following script:
-
-.. code-block:: bash
-
-  python -m torch.distributed.launch --nproc_per_node=<tensor_model_parallel_size> * <pipeline_model_parallel_size> \
-    /NeMo/examples/nlp/language_modeling/megatron_gpt_drop_layers.py \
-      --path_to_nemo /path/to/model.nemo \
-      --path_to_save /path/to/save/trimmed_model.nemo \
-      --tensor_model_parallel_size <tensor_model_parallel_size> \
-      --pipeline_model_parallel_size <pipeline_model_parallel_size> \
-      --gpus_per_node <gpus_per_node>  \
-      --drop_layers 1 2 3 4
-
-**Note:** layer indices start from 1.
-
-To save trimmed model in ``zarr`` checkpoint format, add the following flag to the command above:
-
-.. code-block:: bash
-
-  --zarr
-
-**Note:** the ``zarr`` checkpoint format is deprecated.
-
-Validate Trimmed Model
-----------------------
-
-To validate the trimmed model, use the following script:
-
-.. code-block:: bash
-
-  python /NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
-    --config-path=/path/to/folder/with/model/config \
-    --config-name=model_config.yaml \
-    trainer.limit_val_batches=<limit_val_batches> \
-    model.restore_from_path=/path/to/trimmed_model.nemo \
-    model.skip_train=True \
-    model.data.data_impl=mock \
-    model.data.data_prefix=[]
-
-To use a specific dataset instead of a mock dataset, modify the ``model.data`` parameters as follows:
-
-.. code-block:: bash
-
-  model.data.data_impl=mmap \
-  model.data.data_prefix=["path/to/datafile1", "path/to/datafile2"]
-
-Validate Original Model
------------------------
-
-To validate the original model without specific layers, use the following script:
-
-.. code-block:: bash
-
-  python /NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
-    --config-path=/path/to/folder/with/model/config \
-    --config-name=model_config.yaml \
-    trainer.limit_val_batches=<limit_val_batches> \
-    model.restore_from_path=/path/to/original_model.nemo \
-    model.skip_train=True \
-    model.data.data_impl=mock \
-    model.data.data_prefix=[] \
-    model.drop_layers=[1,2,3,4]