Skip to content

Latest commit

 

History

History
256 lines (191 loc) · 14.6 KB

README.md

File metadata and controls

256 lines (191 loc) · 14.6 KB

Retro and InstructRetro

Retro (Borgeaud et al., 2022) is an autoregressive decoder-only language model (LM) pretrained with retrieval-augmentation. Retro features practical scalability to support large-scale pretraining from scratch by retrieving from trillions of tokens. Pretraining with retrieval provides a more efficient storage mechanism of factual knowledge, when compared to storing factual knowledge implicitly within the network's parameters, thus largely reducing model parameters while achieving lower perplexity than standard GPT. Retro also provides the flexibility to update the knowledge stored in LMs (Wang et al., 2023a) by updating the retrieval database without training LMs again.

InstructRetro (Wang et al., 2023b) further scales up the size of Retro to 48B, featuring the largest LLM pretrained with retrieval (as of December 2023). The obtained foundation model, Retro 48B, largely outperforms the GPT counterpart in terms of perplexity. With instruction tuning on Retro, InstructRetro demonstrates significant improvement over the instruction tuned GPT on downstream tasks in the zero-shot setting. Specifically, the average improvement of InstructRetro is 7% over its GPT counterpart across 8 short-form QA tasks, 10% over GPT across 4 challenging long-form QA tasks, and 16% over GPT across 3 summarization tasks. We also find that one can ablate the encoder from InstructRetro architecture and directly use the InstructRetro decoder backbone as GPT, while achieving comparable results.

This README provides an end-to-end tutorial to reproduce Retro and InstructRetro.

Contents

Checkpoints

We provide the pretrained checkpoints of Retro and InstructRetro in the following table. The checkpoints are available to download through the following links:

Model Size Instruction Tuning Download Link 1 Download Link 2 Download Link 3
retro-8b-base-4k 8b Huggingface NGC Google Drive
retro-8b-instruct-4k 8b Huggingface NGC Google Drive
retro-48b-base-4k 48b Huggingface NGC Google Drive
retro-48b-instruct-4k 48b Huggingface NGC Google Drive

End-to-end Reproduction Guide

In this README, we provide an end-to-end reproduction guide for InstructRetro, covering from large-scale retrieval construction, pretraining, perplexity evaluation, instruction tuning, to downstream task evaluation.

If you are interested in evaluation only, we also open-sourced our checkpoints and you can directly go to Step 5 to evaluate the checkpoints on downstream tasks.

Step 0: Prepare the environment

We recommend using docker environment to run the code.

Docker image

We provide a docker build file in tools/retro/examples/Dockerfile for the reproduction. The docker image is based on the NGC docker nvcr.io/nvidia/pytorch:23.09-py3.

Install dependencies

Clone the Megatron repo:

git clone --branch InstructRetro https://github.com/NVIDIA/Megatron-LM.git

If docker is not available, we recommend starting from a clean conda environment with the following runtime dependencies:

  • Python 3.10
  • NVIDIA CUDA® 12.2.1
  • NVIDIA cuBLAS 12.2.5.6
  • NVIDIA cuDNN 8.9.5
  • NVIDIA NCCL 2.18.5
  • PyTorch 2.1.0a0+32f93b1

Then install Retro-specific dependencies, including:

pip install -U faiss-gpu
pip install -U transformers
pip install -U sentencepiece
pip install -U h5py
pip install -U nltk
pip install -U einops

Step 1: Build retrieval database

In this step, we build a large-scale retrieval database for InstructRetro through Faiss to retrieve from trillions of tokens, and preprocess (and save) the retrieval neighbors for the pretraining step.

Please refer to tools/retro/build_db.md for more details.

Step 2: Pretraining

Please strictly follow Step 1 to build the retrieval database before pretraining to make sure the preprocessed retrieval neighbors match the pretraining corpus.

In the pretraining step, we support both pretraining from scratch and continued pretraining from a pretrained GPT model.

We provide a template pretraining script to pretrain 843M Retro from scratch. Prepare your own arguments and update our templates in tools/retro/examples/pretrain_model.sh. Please note that the data path should be exactly matching the one used in Step 1 to make sure the preprocessed retrieval neighbors match the pretraining corpus.

bash tools/retro/examples/pretrain_model.sh

After pretraining, the model checkpoints will be saved in the --save directory if you specified the arg in pretrain_model.sh.

To continue pretraining with retrieval from a pretrained GPT model, please specify --load in pretrain_model.sh to load the pretrained GPT model checkpoint (the architecture of GPT, including hidden size, number of layers, and activation methods, should be exactly the same as the one used for Retro). You should also specify --no-load-optim --finetune to make sure the optimizer state is not loaded from the pretrained GPT model and the continued pretraining with retrieval is from a clean start. After the first job / the first run, you will continue pretraining with retrieval from your last checkpoint. In the follow-up jobs, you should launch the pretraining without the flags --no-load-optim --finetune to make sure the optimizer state is correctly loaded from your last job.

Step 3: Perplexity evaluation

During pretraining, we will automatically evaluate the model perplexity on the specified validation corpus every --eval-interval steps. The validation corpus should be exactly the same as the one used in Step 1 to make sure the preprocessed retrieval neighbors match the pretraining corpus.

To evaluate the perplexity of a pretrained model, please add --skip-train in pretrain_model.sh to skip the pretraining step and only evaluate the perplexity of the model specified in --load on the validation corpus. Run the above command again to evaluate the perplexity of a pretrained model:

bash tools/retro/examples/pretrain_model.sh

Step 4: Instruction tuning

In this step, we fine-tune the pretrained model on the downstream task with instructions. We provide a template instruction tuning script to fine-tune 843M Retro.

We also provide an open-source blend of instruction tuning datasets. The dataset is available to download through here. The blendable dataset consists of the following open-source instruction tuning datasets:

Instruction Tuning Dataset Breakdown

Dataset Samples Epochs Sampling Prob
soda 2560 0.005 0.020
eli5 2561 0.055 0.020
self_instruct_short 1280 0.043 0.010
self_instruct_long 2560 0.333 0.020
unnatural-instructions 2560 0.024 0.020
flan_cot 1280 0.093 0.010
dolly 6400 0.938 0.050
oasst-skip-noncode 104558 1.839 0.817
oasst-skip-code 4243 1.839 0.033

Refer to the paper links above for more details about each instruction tuning dataset.

We note that the provided instruction tuning dataset is all from open-source instruction tuning datasets. It is slightly different from what we use in InstructRetro, which contains private and proprietary datasets. Thus a 1-2% accuracy difference in downstream tasks may be expected.

Instruction tuning script

Download the blended instruction tuning dataset in your data home directory $DATA_HOME and update our templates in tools/retro/sft/sft_retro_lm.sh.

An example command to run instruction tuning on 843M Retro is as follows:

                                      [blend-dataset-name] [model-size] [batch-size]  [lr]    [checkpoints]
bash tools/retro/sft/sft_retro_lm.sh       open_inst               843m            128    5e-6  <path/to/pretrained/retro>  

The blend_dataset_name argument will blend all the datasets within the $DATA_HOME following the weights and configurations specified in the ${blend_dataset_name}.sh (open_inst.sh in the example above). The checkpoints will be saved in the --save directory. For example, it will be saved to <SFT_HOME>/checkpoints/applications/retro-sft_pp1_same_format_ctx1_843m_128_5e-6.

Step 5: Downstream task evaluation

In this step, we demonstrate how to run InstructRetro for zero-shot evaluation on downstream question answering (QA) tasks. We provide the pre-processed open-source evaluation datasets with a unified format for different tasks. The evaluation datasets used in our paper are available to download through here. Please stick to the same retro workdir used in Step 0-4 to make sure the preprocessed retrieval neighbors match the pretraining corpus. If you directly come to Step 5, an example retro workdir with args.json for 800M Retro is provided here. Note that the args in the json can be overwritten through the command line.

We present an example command to run retro generation given the InstructRetro checkpoints and the Natural Question (NQ) task. The example command is for the 843m InstructRetro obtained in Step 4. Please specify the directory for the NQ dataset and update the command accordingly for other checkpoints.

bash tools/retro/text_generation/retro_generate.sh nq 843m greedy test  0 20000 1000 5 pp1 <SFT_HOME>/checkpoints/applications/retro-sft_pp1_same_format_ctx1_843m_128_5e-6 2

The generated responses will be saved in the corresponding checkpoint directory. For example, for the 843m InstructRetro, it will be saved to <SFT_HOME>/checkpoints/applications/retro-sft_pp1_same_format_ctx1_843m_128_5e-6/retro-generate-nq_5_2_843m_test_greedy_0_20000_1000.txt.

To evaluate the F1 / Exact Match (EM) scores of the generated responses, we provide an example script to run the evaluation on the NQ dataset. Please specify the directory for the NQ dataset and update the command accordingly for other checkpoints and downstream tasks.

python3 tools/retro/text_generation/evaluate.py

Citations

See more details from our papers:

Shall we Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study.

Boxin Wang, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao, Anima Anandkumar, Bryan Catanzaro. (EMNLP 2023)

InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining.

Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, Bryan Catanzaro.

Please cite the papers as follows if you use the data or code from this repo:

@inproceedings{wang2023shall,
    title   = {Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study},
    author  = {Boxin Wang and Wei Ping and Peng Xu and Lawrence McAfee and Zihan Liu and Mohammad Shoeybi and Yi Dong and Oleksii Kuchaiev and Bo Li and Chaowei Xiao and Anima Anandkumar and Bryan Catanzaro},
    journal = {The 2023 Conference on Empirical Methods in Natural Language Processing},
    year    = {2023}
}

@article{wang2023instructretro,
    title   = {InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining},
    author  = {Boxin Wang and Wei Ping and Lawrence McAfee and Peng Xu and Bo Li and Mohammad Shoeybi and Bryan Catanzaro},
    year    = {2023},
    journal = {arXiv preprint arXiv: 2310.07713}
}