fsdp_qlora

Training LLMs with QLoRA + FSDP. This is a demo script, and still a work in progress. See the blog post (coming soon) for more information.

Installation:

The following steps should work (tested on cuda 11.7, 11.8 and 12.1):

Clone https://github.com/AnswerDotAI/fsdp_qlora
pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 llama-recipes as an easy way to get most dependencies
pip uninstall bitsandbytes since for now it must be installed from source to have the necessary changes
Clone bitsandbytes (recent version has our changes) or clone AnswerDotAI/bitsandbytes & switch to cuda_fix_quant_storage_dtype branch git clone -b cuda_fix_quant_storage_dtype https://github.com/AnswerDotAI/bitsandbytes
in bitsandbytes folder, make CUDA_VERSION=118 then python setup.py install (may need export BNB_CUDA_VERSION=118 and to set cuda path, adjust to your preferred version)
pip install fastcore wandb as extra requirements above those in llama-recipes
huggingface-cli login (to access Llama 2 7B)
back in fsdp_qlora folder, run python train.py to test qlora training

Check out different combos of settings. For example, python train.py --train_type lora to do lora instead of the default qlora

Mixed Precision Training

`--precision bf16` (pure bfloat16)

This will cast all the model parameters to torch.bfloat16 before training and won't use FSDP mixed precision. As a result, sharded and unsharded params will be stored in bf16, forward and backward passes will be done in bf16, and gradient reduction and updates will be done in bf16.

`--precision fp32` (pure float32)

This will cast all the model parameters to torch.float32 before training and won't use FSDP mixed precision. As a result, sharded and unsharded params will be stored in fp32, forward and backward passes will be done in fp32, and gradient reduction and updates will be done in fp32.

`--precision mp_fp16_autocast` (mixed float16 with autocast)

This will cast all the model parameters to torch.float32 before training and will use FSDP mixed precision with

mp_policy = MixedPrecision(param_dtype=torch.float32, reduce_dtype=torch.float32, buffer_dtype=torch.float32)

As a results, sharded and unsharded params will be stored in fp32. It will use autocast(torch.float16) for forward and backward passes, and autocast(torch.float16) for gradient reduction and updates.

`--precision mp_bf16_autocast` (mixed bfloat16 with autocast)

This will cast all the model parameters to torch.float32 before training and will use FSDP mixed precision with

mp_policy = MixedPrecision(param_dtype=torch.float32, reduce_dtype=torch.float32, buffer_dtype=torch.float32)

As a results, sharded and unsharded params will be stored in fp32. It will use autocast(torch.bfloat16) for forward and backward passes, and autocast(torch.bfloat16) for gradient reduction and updates.

`--precision mp_bf16_buffers_autocast` (bfloat16 params and float32 buffers with autocast)

This will cast all the model parameters to torch.bfloat16 before training but will keep the buffers in torch.float32 and will use FSDP mixed precision with

mp_policy = MixedPrecision(param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.float32)

As a results, sharded and unsharded params will be stored in bf16. It will use autocast(torch.bfloat16) for forward and backward passes, and autocast(torch.bfloat16) for gradient reduction and updates. Buffers and only eligible operations in autocast will be performed in bf16.

Comparinson to an existing trainer

hf_train.py uses TRL's SFTTrainer for a comparison run. To match with our script, modify the dataloading code to train on everything (not just completions) and then run train.py --train_type qlora --dataset guanaco --batch_size 8 --lr_scheduler cosine --log_to wandb --save_model True --output_dir guanaco_7B --gradient_accumulation_steps 2 --lr 2e-4. The SFTTrainer version has to run with a lower batch size (4 vs 8) so we only do 2 gradient accumulation steps vs 4 in the QLoRA+FSDP version.

Converting Saved Models

If you specify --save_model True the adapter layers will be saved as a state dict. To convert to the regular huggingface format and upload to the hub, see: Converting the State Dict.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
benchmarking		benchmarking
nbs		nbs
.gitignore		.gitignore
Converting the State Dict.ipynb		Converting the State Dict.ipynb
README.md		README.md
hf_train.py		hf_train.py
table1.sh		table1.sh
train.py		train.py
train.sh		train.sh
train_sql.sh		train_sql.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fsdp_qlora

Installation:

Mixed Precision Training

`--precision bf16` (pure bfloat16)

`--precision fp32` (pure float32)

`--precision mp_fp16_autocast` (mixed float16 with autocast)

`--precision mp_bf16_autocast` (mixed bfloat16 with autocast)

`--precision mp_bf16_buffers_autocast` (bfloat16 params and float32 buffers with autocast)

Comparinson to an existing trainer

Converting Saved Models

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

fsdp_qlora

Installation:

Mixed Precision Training

--precision bf16 (pure bfloat16)

--precision fp32 (pure float32)

--precision mp_fp16_autocast (mixed float16 with autocast)

--precision mp_bf16_autocast (mixed bfloat16 with autocast)

--precision mp_bf16_buffers_autocast (bfloat16 params and float32 buffers with autocast)

Comparinson to an existing trainer

Converting Saved Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`--precision bf16` (pure bfloat16)

`--precision fp32` (pure float32)

`--precision mp_fp16_autocast` (mixed float16 with autocast)

`--precision mp_bf16_autocast` (mixed bfloat16 with autocast)

`--precision mp_bf16_buffers_autocast` (bfloat16 params and float32 buffers with autocast)

Packages