Skip to content

Conversation

@zeroRains
Copy link
Contributor

Environment:

  • Device: 2 * Tesla V100
  • Cuda: 11.8
  • paddlepaddle-gpu == 3.0.0
  • pytorch == 2.5.1

I modify the distributed loading command and write two .sh file run_paddle_parallel_cpu.sh and run_paddle_parallel_gpu.sh. It is a standard distributed lauching command in paddle.

I also add a unit test to test distibuted loading tensors with paddle.

@zeroRains
Copy link
Contributor Author

Hello, I need to confirm the parallel loading means we use multi processes to load all tensors to one gpu? Or use multi processes to load all tensors to different gpu then broadcast them?

@takeshi-yoshimura
Copy link
Collaborator

@zeroRains
thank you for your contribution again! let me confirm your change later.

I need to confirm the parallel loading means we use multi processes to load all tensors to one gpu? Or use multi processes to load all tensors to different gpu then broadcast them?

It depends on what you want to do. test cases use a single GPU due to their limited environment, but in realistic workloads processes should load files to each GPU and broadcast/scatter tensors.

fix the distributed load for paddle

remove useless file

make sure the device id does not exceed the device count

Signed-off-by: zeroRains <[email protected]>
Comment on lines -74 to +84
d_id = device.split(":") # "gpu:0" or "gpu"
d_id = int(d_id[1]) if len(d_id) == 2 else 0
if isinstance(self.pg, SingleGroup):
# For single (gpu:x, gpu)
# gpu:x, like gpu:0, gpu:1, ...
d_id = device.split(":")
d_id = int(d_id[1]) if len(d_id) == 2 else 0
else:
# For distributed
# The gpu determines the current rank
# rank0 use gpu:0, rank1 use gpu:1
d_id = self.pg.rank() % paddle.device.cuda.device_count()
self.device = f"gpu:{d_id}"
Copy link
Contributor Author

@zeroRains zeroRains Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this part, maybe, It dose not need to consider distributed case in fastsafetensors.

We just need to load the tensors to correct device which provided by user.

In a machine with multi gpus, user should set the device like that device="gpu:{pg.rank()}" in distributed code then send device to the SafeTensorsFileLoader so that different processes can load tensors to different gpus .

What do you think?

Copy link
Collaborator

@takeshi-yoshimura takeshi-yoshimura Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so because safetensors files that are distributed online are not composed like that.

@takeshi-yoshimura takeshi-yoshimura merged commit 18391ca into foundation-model-stack:main Jun 6, 2025
11 of 13 checks passed
@takeshi-yoshimura
Copy link
Collaborator

Thank you!
let me refactor all the code to resolve lint errors...

@zeroRains zeroRains deleted the dist branch June 6, 2025 02:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants