Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does this work for Llama2 - Fine-tune Falcon 180B with DeepSpeed ZeRO, LoRA & Flash Attention? #37

Open
ibicdev opened this issue Sep 21, 2023 · 11 comments

Comments

@ibicdev
Copy link

ibicdev commented Sep 21, 2023

Thanks Phil for the great post "Fine-tune Falcon 180B with DeepSpeed ZeRO, LoRA & Flash Attention". When I tried to change falon to llama2 (tried all 3 mode sizes), I always got "CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)". Should there be more changes than just model name to make it work? Or will you have a follow up post about fine tuning Llama2 with DeepSpeed + LoRA?

@philschmid
Copy link
Owner

Seems to be an hardware and environment issue unrelated to the code. I used cuda 11.8

@ibicdev
Copy link
Author

ibicdev commented Sep 22, 2023

I am also using cuda 11.8, and pytorch 2.01 for cuda 11.8. Also tried pytorch nightly and got the same error. --use_flash_attn False didn't make a difference either. The error is RuntimeError: CUDA error: device-side assert triggered, followed by about a hundred lines of

../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.

This error looked similar to lm-sys/FastChat#199; tried their suggests and none worked. One explanation on that thread is the vocab causing embedding lookup out-of-bounds issue though the vocab seems already fixed in llama-2.

@philschmid
Copy link
Owner

Does the example without code changes work?

@ibicdev
Copy link
Author

ibicdev commented Sep 23, 2023

Yes, it worked well without any code change.

@philschmid
Copy link
Owner

What change did you make?

@ibicdev
Copy link
Author

ibicdev commented Sep 25, 2023

The only change I made is --model_id, from tiiuae/falcon-180B to meta-llama/Llama-2-70b-hf. The full command is

torchrun --nproc_per_node 8 run_ds_lora.py \
  --model_id meta-llama/Llama-2-70b-hf \
  --dataset_path dolly-processed \
  --output_dir falcon-180b-lora-fa \
  --num_train_epochs 3 \
  --per_device_train_batch_size 1 \
  --learning_rate 4e-3 \
  --gradient_checkpointing True \
  --gradient_accumulation_steps 8 \
  --bf16 True \
  --tf32 True \
  --use_flash_attn True \
  --lr_scheduler_type "constant_with_warmup" \
  --logging_steps 25 \
  --save_steps 100 \
  --save_total_limit 3 \
  --deepspeed configs/ds_falcon_180b_z3.json

@philschmid
Copy link
Owner

did you make changes to the flash attention patch? The example only works with falcon since it has a custom patch to use flash attention.

@ibicdev
Copy link
Author

ibicdev commented Sep 25, 2023

Ahh, I didn't. I saw your code https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/utils/peft_utils.py#L38-L41, and thought it's already taken care of.

Also, even when I used --use_flash_attn False I still got the same error.

@ibicdev
Copy link
Author

ibicdev commented Sep 26, 2023

Excited to see flash-attn 2 natively supported in transformers! Would you plan to update this post to work with this new feature?

@philschmid
Copy link
Owner

Yes! 👍🏻 Plan to update all my posts and remove that patches once there is an official release.

@ibicdev
Copy link
Author

ibicdev commented Sep 26, 2023

Great! Looking forward to the updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants