Trainer won't achieve Data Parallelism #54

b1nch3f · 2022-06-30T02:05:46Z

b1nch3f
Jun 30, 2022

Data Parallelism is when we split the mini-batch of samples into multiple smaller mini-batches and run the computation for each of the smaller mini-batches in parallel. ^[1]

Correct me if i'm wrong, but the following code creates subprocess, so basically the same script will take the entire mini-batch. In a multiple GPU situation, it doesn't look the code will exploit multiple GPUs because it's not written to achieve data-parallelism.

Trainer/trainer/distribute.py

Line 54 in 0b3c88a

    
           p = subprocess.Popen(["python3"] + command, stdout=stdout, env=my_env)  # pylint: disable=consider-using-with

Additionally, let's say I put save checkpoint at 100th iteration, not sure if the trainer is capable to save that checkpoint with the least loss, it's like at 100th iteration, the script that finishes last will overwrite the checkpoints saved by the scripts ran previously on other GPUs.

I looked at trainer code after fine-tuning on 4 GPUs, initially i ran it on 2 GPUs and expected a quicker run with 4 GPUs, but that didn't happen and it had the same runtime like 2 GPUs.

Ref.

https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html

erogol · 2022-07-05T09:20:51Z

erogol
Jul 5, 2022
Maintainer

You need to set your data loader accordingly if you intend to use DataParallelism. 👟 only wraps your model for DDP and runs it in the right way for DDP. The rest relies on your model implementation. So a basic change would be using DistributedSampler to send different splits of data to different GPUs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer won't achieve Data Parallelism #54

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Trainer won't achieve Data Parallelism #54

b1nch3f Jun 30, 2022

Replies: 1 comment

erogol Jul 5, 2022 Maintainer

b1nch3f
Jun 30, 2022

erogol
Jul 5, 2022
Maintainer