Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: 张量并行推理内存占用异常? #8656

Closed
zhaogf01 opened this issue Jun 25, 2024 · 6 comments
Closed

[Question]: 张量并行推理内存占用异常? #8656

zhaogf01 opened this issue Jun 25, 2024 · 6 comments
Assignees
Labels
question Further information is requested

Comments

@zhaogf01
Copy link
Contributor

请提出你的问题

我在进行qwen-1_8模型推理时:
当开启2路张量并行时,在load权重时,内存占用是10GB左右
当开启4路张量并行时,在load权重时,内存占用是17GB左右
两者的差距正好是权重文件占用空间的2倍
因此,我想问,假设是2路张量并行,paddlenlp在load权重时是否是先将权重复制2份并存放在内存,之后在进行张量拆分?如果是的话,在进行16路张量并行时,就要复制16份?假使是千亿模型,那内存的占用量必然更多,这是合理的吗?

@zhaogf01 zhaogf01 added the question Further information is requested label Jun 25, 2024
@DesmonDay
Copy link
Contributor

不合理,这块我们有优化代码还没有提交。想问下你是咋运行的,可以给个脚本来。

@zhaogf01
Copy link
Contributor Author

不合理,这块我们有优化代码还没有提交。想问下你是咋运行的,可以给个脚本来。

脚本如下:
1、test.sh

 export CUDA_VISIBLE_DEVICES=3,4
 PYTHONPATH=../../:$PYTHONPATH  \
 python  -m paddle.distributed.launch \
      --devices "3,4" \
      test_qwen.py

2、test_qwen.py

from paddle.distributed import fleet
from paddlenlp.transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("qwen/qwen-1_8b")
strategy = fleet.DistributedStrategy()
 strategy.hybrid_configs = {
                   "dp_degree": 1,
                   "mp_degree": 2,
                   "pp_degree": 1,
                   "sharding_degree": 1,
              }
  fleet.init(is_collective=True, strategy=strategy)
  hcg = fleet.get_hybrid_communicate_group()
  tensor_parallel_rank = hcg.get_model_parallel_rank()
  model = AutoModelForCausalLM.from_pretrained("qwen/qwen-1_8b" , tensor_parallel_degree=2, tensor_parallel_rank=tensor_parallel_rank, dtype="float32")
  input_features = tokenizer("青岛推荐去哪玩", return_tensors="pd")
  outputs = model.generate(**input_features, max_length=128)
  print(tokenizer.batch_decode(outputs[0]))

@DesmonDay
Copy link
Contributor

qwen使用的模型参数是safetensors格式还是pdparams格式?如果是safetensors格式,应该不会有这个问题。

@zhaogf01
Copy link
Contributor Author

qwen使用的模型参数是safetensors格式还是pdparams格式?如果是safetensors格式,应该不会有这个问题。

是pdparams格式的,这个权重是自动下载的。
那目前有没有平替的解决方案?或者有没有从pdparams到safetensors的转换脚本?

@DesmonDay
Copy link
Contributor

可以在模型from_pretrained之后,调用save_pretrained方法来保存,设置safe_serialization=True。

@zhaogf01
Copy link
Contributor Author

zhaogf01 commented Jul 9, 2024

我想进行tp推理,我在模型from_pretrained是否需要进行相应的tp配置,然后再调用save_pretrained方法来保存

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants