Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide the docker file #7

Closed
merlintang opened this issue Aug 28, 2023 · 22 comments
Closed

provide the docker file #7

merlintang opened this issue Aug 28, 2023 · 22 comments
Assignees

Comments

@merlintang
Copy link
Contributor

No description provided.

@mikecovlee
Copy link
Member

I found that the runners provided by github action did not contains GPU capacity. Self-hosted runner can solve this problem if we need continuous integration.

@merlintang
Copy link
Contributor Author

can @LianxinGao look at this , and provide the self container running in our dev env

@LianxinGao
Copy link
Contributor

can @LianxinGao look at this , and provide the self container running in our dev env

@merlintang gpu_runner launched on gpu01 machine. @mikecovlee 4 gpu, number from 0 to 3. (4090: 0,2,3; 3090:1 )

@mikecovlee
Copy link
Member

The main entrance mlora.py have finished in main branch, plz test it with llama-7b and demo dataset.

python mlora.py --base_model <path to llama-7b> --config ./config/finetune.json --load_8bit true

@LianxinGao
Copy link
Contributor

LianxinGao commented Aug 30, 2023

python mlora.py --base_model --config ./config/finetune.json --load_8bit true

@mikecovlee Where to configure gpu device? Not found in finetune.json

@mikecovlee
Copy link
Member

Use --device argument. Default is cuda:0, ASPEN can utilize one GPU only.

@LianxinGao
Copy link
Contributor

The main entrance mlora.py have finished in main branch, plz test it with llama-7b and demo dataset.

python mlora.py --base_model <path to llama-7b> --config ./config/finetune.json --load_8bit true

error @mikecovlee :

[2023-08-30 14:46:10] ASPEN: NVIDIA CUDA initialized successfully.
[2023-08-30 14:46:10] ASPEN: Total 3 GPU(s) detected.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.57s/it]
to load text data from file.
load text data from file done.
to encode text data to tokens
encode text data: 0/2
encode text data: 0/2
encode text data to tokens done.
lora_0 train data:
    epoch: 1 / 3
    step : 0 / 2
lora_1 train data:
    epoch: 1 / 3
    step : 0 / 2
batch data size: 32 * 4
/opt/conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Traceback (most recent call last):
  File "/data/glx/code/multi_lora/mlora.py", line 157, in <module>
    train(config, model)
  File "/data/glx/code/multi_lora/mlora.py", line 127, in train
    :].contiguous().view(-1, llama_model.vocab_size_)
RuntimeError: shape '[-1, 32000]' is invalid for input of size 1984062

@LianxinGao
Copy link
Contributor

@merlintang done #18

@yezhengmao1
Copy link
Collaborator

The main entrance mlora.py have finished in main branch, plz test it with llama-7b and demo dataset.

python mlora.py --base_model <path to llama-7b> --config ./config/finetune.json --load_8bit true

error @mikecovlee :

[2023-08-30 14:46:10] ASPEN: NVIDIA CUDA initialized successfully.
[2023-08-30 14:46:10] ASPEN: Total 3 GPU(s) detected.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.57s/it]
to load text data from file.
load text data from file done.
to encode text data to tokens
encode text data: 0/2
encode text data: 0/2
encode text data to tokens done.
lora_0 train data:
    epoch: 1 / 3
    step : 0 / 2
lora_1 train data:
    epoch: 1 / 3
    step : 0 / 2
batch data size: 32 * 4
/opt/conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Traceback (most recent call last):
  File "/data/glx/code/multi_lora/mlora.py", line 157, in <module>
    train(config, model)
  File "/data/glx/code/multi_lora/mlora.py", line 127, in train
    :].contiguous().view(-1, llama_model.vocab_size_)
RuntimeError: shape '[-1, 32000]' is invalid for input of size 1984062

can you check vicuna-7b vocab size ?

@yezhengmao1
Copy link
Collaborator

@mikecovlee the vicuna-7b and llama-7b have different lm_head / embedding size, need to adapt.

@mikecovlee
Copy link
Member

My tests are passed on both vicuna-7b and llama-7b-tf on current main branch

@LianxinGao
Copy link
Contributor

@mikecovlee which version of vicuna7B are you using?

@mikecovlee
Copy link
Member

@LianxinGao vicuna-7b-delta-v1.1

@LianxinGao
Copy link
Contributor

@LianxinGao vicuna-7b-delta-v1.1

I'll change the version of vicuna in ci, and retest it

@mikecovlee
Copy link
Member

You can directly commit to mikecovlee_dev branch which on a draft pull request fixing CI errors. @LianxinGao

@mikecovlee
Copy link
Member

Now I split CI checks on GPU into to separate jobs. Tests on LLaMA-7B are passed while failed on Vicuna-7B. #21

@mikecovlee
Copy link
Member

Plz check local model llama-7b-hf because the config file lack of max_sequence_length field.

Referring to CI runs. Local machine tests are passed.

@LianxinGao

@mikecovlee
Copy link
Member

Btw, later commits plz create a new branch rather than mikecovlee_dev.

@LianxinGao
Copy link
Contributor

Plz check local model llama-7b-hf because the config file lack of max_sequence_length field.

Referring to CI runs. Local machine tests are passed.

@LianxinGao

The models on the machine seems buggy😭,now all fixed....

@yezhengmao1
Copy link
Collaborator

also found this, i will fix it.

@merlintang
Copy link
Contributor Author

@LianxinGao can you send a pr with a docker file

@LianxinGao
Copy link
Contributor

@LianxinGao can you send a pr with a docker file

ok,I'll do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants