-
Notifications
You must be signed in to change notification settings - Fork 42
DDP address in use #152
Comments
this is with mmi_bigram_train.py, with --world-size=1, --local-rank=0 |
... i.e. the defaults, no extra args. |
We might need to make the port configurable. For a quick work-around you can change it here: https://github.com/k2-fsa/snowfall/blob/master/snowfall/dist.py#L8 |
yeah I did that. |
We can choose it randomly - although I think with |
EDIT: Forget the following, I set CUDA_VISIBLE_DEVICES wrong.Getting this error:
|
Hmm, I've never seen this one before... |
I found something about why, when we use torch.distributed.launch, it was hanging at the end. (Caution: my lhotse was not fully up to date, although sampling.py doesn't seem to have changed in the interim).
.. .that's just FYI. I am confused why it never says the local rank is 1, and why the last message says 1426.
Anyway, when I check the times of where it starts each epoch, this rank=1 job does not seem to be correctly synchronized with the rank=0 job. It finishes epoch 0 when the rank=0 job has only finished about 90% of its minibatches. |
Wouldn't it be easier, in order to support distributed training, to just have the sampler process things as normal and then return 1 out of every world_size minibatches? The time that it processes that metadata will probably overlap with GPU stuff anyway, I don't really think that's going to be the limiting factor. |
I confirm that if different nodes have a different number of utterances in its dataloader, the node with most utterances will hang in the end. I suspect the reason is due to A minimal example to reproduce it is given below. The current approach to partition the dataset over different nodes cannot guranteen that every node receives the same amount of utterances. See total = len(data_source)
per_partition = int(ceil(total / float(world_size)))
partition_start = rank * per_partition
partition_end = min(partition_start + per_partition, total) The node with the largest rank value receives fewer #!/usr/bin/env python3
import os
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
import torch
import datetime
def run(rank: int, world_size: int):
print(f'world_size: {world_size}')
device = torch.device('cuda', rank)
if rank != 0:
data = [torch.tensor([1], device=device, dtype=torch.float32) for _ in range(world_size)]
else:
data = [torch.tensor([1], device=device, dtype=torch.float32) for _ in range(world_size*100)]
# NOTE: `data` on rank 0 has more entries
dist.barrier()
model = torch.nn.Linear(1, 1).to(device)
model = DDP(model, device_ids=[rank])
for i, d in enumerate(data):
model.zero_grad()
y = model(d)
y.backward()
print(f'rank {rank} done')
# node with rank==0 will exit after timeout (5 seconds)
# The default timeout is 30 minutes. But it comes into effect
# only if one of the following environment variables is
# set:
# - NCCL_ASYNC_ERROR_HANDLING
# - NCCL_BLOCKING_WAIT
# See https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group
def init_process(rank: int, world_size: int, fn):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12357'
dist.init_process_group('nccl',
rank=rank,
world_size=world_size,
timeout=datetime.timedelta(0, 5))
fn(rank, world_size)
if __name__ == '__main__':
print(f'dist.is_available: {dist.is_available()}')
world_size = 3
processes = []
mp.set_start_method('spawn')
for rank in range(world_size):
p = mp.Process(target=init_process, args=(rank, world_size, run))
p.start()
processes.append(p)
for p in processes:
p.join() Its output is:
|
@pzelasko |
Oooh, now it all finally makes sense. Thanks for debugging this guys. I'll add a fix to the cut ids partitioning in the sampler. |
I'm going to use @danpovey's solution rather than @csukuangfj's solution -- unfortunately, it is not straightforward to estimate how many utterances should be dropped in |
@danpovey @csukuangfj can you please try out the version in PR lhotse-speech/lhotse#267 and let me know if it helped? I won't be able to test the snowfall distributed training setup today, but based on the unit tests I wrote it seems to have fixed the issues with an unequal number of batches in each worker. |
@pzelasko |
@pzelasko Here is the tensorboard log of DDP training with 3 GPUs: And the WERs are
The WERs are worse than that of single GPU training. I believe the reason is due to the learning rate. I believe if we train it for more epochs, it can achieve similar results. NOTE: The training time per epoch with 3 GPUs is about 16 minutes, which is about 1/3 of single GPU training. |
Great!!
I think in order to get comparable results to the baseline we'd have to
divide the minibatch size by the number of workers.
Let's merge this?
…On Tue, Apr 13, 2021 at 9:21 AM Fangjun Kuang ***@***.***> wrote:
@pzelasko <https://github.com/pzelasko>
I confirm that the current change can solve the hanging problem.
Here is the tensorboard log of DDP training with 3 GPUs:
- https://tensorboard.dev/experiment/UGf0fbC0QianY9WyeghJpQ/
And the WERs are
2021-04-13 09:12:14,242 INFO [common.py:356] [test-clean] %WER 7.39% [3885 / 52576, 505 ins, 330 del, 3050 sub ]
2021-04-13 09:14:29,223 INFO [common.py:356] [test-other] %WER 18.82% [9849 / 52343, 1149 ins, 863 del, 7837 sub ]
The WERs are worse than that of single GPU training. I believe the reason
is due to the learning rate.
You can compare the learning rate from the above tensorboard log with the
one from single GPU training
<https://tensorboard.dev/experiment/h3xiWY0oQ4WGd2dRgG8NWw>.
I believe if we train it for more epochs, it can achieve similar results.
NOTE: The training time per epoch with 3 GPUs is about 16 minutes, which
is about 1/3 of single GPU training.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#152 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO3I4TUSDORV2AUBS6TTIOMBBANCNFSM42WSEJPA>
.
|
Oh it's a PR to lhotse, we'll wait for Piotr to merge. |
Merged! |
FYI this could be of interest to us https://huggingface.co/blog/accelerate-library |
When I try to run more than one trainings (with a single job) on the same machine, I get this:
The text was updated successfully, but these errors were encountered: