-
Notifications
You must be signed in to change notification settings - Fork 42
[WIP] Initially working multi-gpu training #71
Conversation
I also need to check if Lhotse dataset/dataloader does not return duplicate cuts in the same epoch. |
Great!! |
K2SpeechRecognitionIterableDataset has no 'len()', so torch.utils.data.distributed.DistributedSampler cannot split the dataset to subsets. |
That is expected - it's an "iterable dataset", so it cannot use samplers. I still have to check how to make it compatible with distributed training but I don't expect any hurdles (it already supports splitting datasets into partitions for parallel dataloader workers, will probably just have to split into more partitions for distributed training). |
This seems to work correctly with Lhotse's PR lhotse-speech/lhotse#194 I verified that the cuts are not duplicated in an epoch by dumping the cut IDs from individual workers' partitions into files and comparing them. I went on added loss/num_frames synchronization to the master node so that we're logging those quantities correctly. I'll let the training finish and see if it works OK till the end (and what is the WER). I might have another idea that could make this a bit simpler to use -- DataLoader has a "batch_sampler" argument so maybe dynamic batching could be performed inside of that instead of in the Dataset. Then we could return to using "map-style" Datasets and I think Lhotse's code could be greatly simplified + the whole thing would be closer to standard PyTorch workflows. Let me check that out before we merge these things. |
Cool! |
The LFMMI training seems to work both with single and multi GPU now; I'll post the results for 2 GPU once the training is done and then if it looks ok, we can merge. |
Now that I try to train the full thing with 2 GPUs, I see the training consistently hanging in epoch 1 at about 1000 steps. Both GPU and CPU usage is shown as 100% but the training stops progressing (and GPU power use is very low, 70 / 250 W, which I observed in the past is often an indicator of it not being actually utilized). I started debugging by inspecting python's stack using Then, I checked the native stack strace by attaching gdb to the main process, and this is it:
I am not sure what to make of it, or whether it's a K2 or not K2 issue (interestingly, practically the same code seems not to hang in a different project I'm working on). |
Clarification: by "the same code" I meant the same method of distributed training (DDP, sync batch norm, setup, cleanup, Lhotse's dataloaders) - not the model code, and the project is not using K2. |
I verified the issue occurs on a completely different machine. I'm building the latest K2 from master, with CUDA 10.2 and PyTorch 1.7.1. The GPUs are GTX1080/RTX2080. |
Mm. See if you can run it in cuda-gdb, the command |
This is what I got:
The backtrace ( |
Mm, try running under pdb and see if you can get a Python trace, searching
for ncclReduceRing indicates it might be something to do with a batchnorm.
But it could be earlier errors were the problem, not whatever you break
into.
Running it under profiling, i.e. nsys profile [command], may show something
useful, although it will tend to generate a very large file.
…On Sat, Feb 13, 2021 at 12:01 AM Piotr Żelasko ***@***.***> wrote:
This is what I got:
(cuda-gdb) info cuda kernels
Kernel Parent Dev Grid Status SMs Mask GridDim BlockDim Invocation
* 0 - 0 52162520 Active 0x00000001 (1,1,1) (64,1,1) ncclReduceRingLLKernel_sum_f64()
The backtrace (bt) shows only this cuda kernel too, so I attached with
normal gdb again and the stack trace was the same as I previously shared.
Does it give you any ideas?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#71 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO46TIE6KEW7HDWRIU3S6VGETANCNFSM4VYPGMYQ>
.
|
Your suggestion gave me another idea, so this time I ran it with
|
I found it! The culprit was "torch.distributed.reduce" which causes the NCCL to hang for some reason. After I discovered that I started searching if others had this issue and there are plenty issues in PyTorch's repo about reduce/allreduce + NCCL. I will see if it helps when I replace it with "gather" and sum manually. |
It seems something gets consistently stuck after 3000 steps when I use either reduce/gather/all_reduce/all_gather. I've simply removed the loss/num_frames sync across the GPUs, so the reported values are just from the master process. It is a half-measure but one I can live with for now (unless somebody has a better idea). I set up the validation dataloader so that it evaluates the full set on each GPU, so the validation values are presented for the full dev set. The training runs for 3 epochs now - once it finishes, I will update the RESULTS file, resolve conflicts and merge. |
OK, great that you fixed it.
…On Sat, Feb 13, 2021 at 10:37 AM Piotr Żelasko ***@***.***> wrote:
It seems something gets consistently stuck after 3000 steps when I use
either reduce/gather/all_reduce/all_gather. I've simply removed the
loss/num_frames sync across the GPUs, so the reported values are just from
the master process. It is a half-measure but one I can live with for now
(unless somebody has a better idea). I set up the validation dataloader so
that it evaluates the full set on each GPU, so the validation values are
presented for the full dev set. The training runs for 3 epochs now - once
it finishes, I will update the RESULTS file, resolve conflicts and merge.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#71 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO7FKCNKSVRU7XEOFF3S6XQVXANCNFSM4VYPGMYQ>
.
|
…when using bucketing
Unfortunately, I didn't. After these changes, it just hangs later (epoch 9). The WER of epoch 8 is 10.74% so it seems to be working correctly, until the NCCL deadlock. I tried chaging the environment - I used CUDA 11 + CUDNN 8.0.4 and re-built K2 with them. But then, curiously, I'm getting the following error:
Really not sure how to proceed. |
snowfall/egs/librispeech/asr/simple_v1/mmi_bigram_train.py Lines 58 to 61 in c396e55
Can you print the value of
|
After I added the print statements, I got the following error (running with one GPU, i.e.
|
Also, the NCCL hanging error is extremely deterministic - after I resume the training from the start of epoch 9, it will hang at exactly the same batch as in the previous run. |
RE the nccl error, I wonder whether you could figure out which batch it is
and modify the code to skip previous batches somehow so we can isolate it
and get the nsys trace?
RE the error with CUDA 11: if it happens at the start of training, perhaps
you could look at the nsys profile output? It might be an error from a
previous kernel (?).. although I'd think device side assert would fail
immediately. Is it on the 1st minibatch?
…On Tue, Feb 16, 2021 at 10:04 AM Piotr Żelasko ***@***.***> wrote:
Also, the NCCL hanging error is extremely deterministic - after I resume
the training from the start of epoch 9, it will hang at exactly the same
batch as in the previous run.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#71 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO6SBQG7PHKG56HV6L3S7HHCBANCNFSM4VYPGMYQ>
.
|
Yes, the first minibatch. I’ll look at the nsys output and let you know. Isolating the minibatch sounds good, will try it as well. |
Thanks... |
.. also, |
I started the training from epoch 9, batch 1390, which was the last batch printed out before the program hanged (2 GPUs, CUDA 10.2). It did hang again, so I went on and ran it with nsys. Unfortunately, I fail to see the potential reason for hanging in the report. You can download the profile at this URL and maybe you'll be able to read it better than I (it's about 20MB so don't worry about an excessive size). As for 1 GPU + CUDA 11, I have the nsys profile too (~10MB), but can't seem to extract anything useful out of it.. |
One more thing - in CUDA 11 case, I noticed the following failed assertions just before the crash in
|
OK, so in the CUDA 11 case we should be able to debug that; indexing is
getting the wrong indexes.
Did you run with
CUDA_LAUNCH_BLOCKING=1
? If not please do; we need a Python stack trace to debug it.
I thought PyTorch indexing actually printed the problem if indexes were
wrong, but it seems not in this particular case.
In the case where it hung: can you run in cuda-gdb and ctrl-c when it hangs
and check that no kernels are running? If one is, it might give us
something to debug.
Also: using cuda-gdb, you can do
set cuda memcheck on
before running it, which might give us something.
…On Wed, Feb 17, 2021 at 1:49 AM Piotr Żelasko ***@***.***> wrote:
One more thing - in CUDA 11 case, I noticed the following failed
assertions just before the crash in print(tot_scores):
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [32,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [33,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [34,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [35,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [36,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [37,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [38,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [39,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [40,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [41,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [42,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [43,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [44,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [45,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [46,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [47,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [48,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [49,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [50,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [51,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [52,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [53,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [54,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [55,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [56,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [57,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [58,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [60,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [61,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [62,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [47,0,0], thread: [63,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#71 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO3HWOZTFBIW4Y54W6LS7KVZBANCNFSM4VYPGMYQ>
.
|
BTW, although we should definitely debug this, I don't think we need to drop everything to do this. |
And BTW, if nsys isn't loading the other qdrep file it could be because the version of Nsight Systems is not new enough. But don't worry about that; likely it's not the best way to debug the CUDA 11 setup anyway. For the hang: something that will be helpful is stack traces for all the processes. cuda-gdb may only give it for the parent process. It might be necessary to, individually for the other processes that are running, do something like: |
I’m OK to go debug it one step at a time in my own „background thread”, as long as we have ideas for the next steps. I’ll proceed with your suggestions and get back to you. About the qdrep files - I was able to open them and see the profile, I meant that I haven’t learned anything useful from it. |
Most important next steps are those about n-best list rescoring and to do with extracting phone-synchronous features. |
For CUDA 11 bug I ran it with CUDA_LAUNCH_BLOCKING=1 and K2_SYNC_KERNELS=1, but the output was the same. Actually yes, let's put a pin on this one. I will resolve the conflicts and merge (if you run single GPU training the NCCL hanging issue does not arise; if you use CUDA 11 the issue does not seem related to this PR's changes). We can debug this further in the future. Maybe the hanging problem won't arise with a different architecture that doesn't use LSTM (I wouldn't be shocked). |
One good bit of news - the I've added an option to use bucketing in the Transformer MMI recipe too, it seems like the strongest recipe currently so I'll check how useful it is there. |
Thanks!! Merging. |
@pzelasko |
I've only seen ~300 steps but the model seems to be converging alright, and both GPUs are close to 100% load. This probably needs further work to make sure the checkpoints work ok and the dev data scores are aggregated from all nodes. It also needs a few
if
s to handle both single-gpu and multi-gpu correctly.You can run this like:
You probably expected that to work already, but it's nice to see K2 running on multi-gpu :)