Replies: 1 comment
-
You need to set your data loader accordingly if you intend to use DataParallelism. 👟 only wraps your model for DDP and runs it in the right way for DDP. The rest relies on your model implementation. So a basic change would be using DistributedSampler to send different splits of data to different GPUs. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Data Parallelism is when we split the mini-batch of samples into multiple smaller mini-batches and run the computation for each of the smaller mini-batches in parallel. [1]
Correct me if i'm wrong, but the following code creates subprocess, so basically the same script will take the entire mini-batch. In a multiple GPU situation, it doesn't look the code will exploit multiple GPUs because it's not written to achieve data-parallelism.
Trainer/trainer/distribute.py
Line 54 in 0b3c88a
Additionally, let's say I put save checkpoint at 100th iteration, not sure if the trainer is capable to save that checkpoint with the least loss, it's like at 100th iteration, the script that finishes last will overwrite the checkpoints saved by the scripts ran previously on other GPUs.
I looked at trainer code after fine-tuning on 4 GPUs, initially i ran it on 2 GPUs and expected a quicker run with 4 GPUs, but that didn't happen and it had the same runtime like 2 GPUs.
Ref.
Beta Was this translation helpful? Give feedback.
All reactions