Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training error #12

Open
wzr0108 opened this issue Apr 30, 2023 · 3 comments
Open

training error #12

wzr0108 opened this issue Apr 30, 2023 · 3 comments

Comments

@wzr0108
Copy link

wzr0108 commented Apr 30, 2023

when I run the code following this command

python main.py --config config/meta_portrait_256_pretrain_warp.yaml --fp16 --stage Warp --task Pretrain

I met this error

start to train...
Epoch 1 Iter 0 D/Time : 6.768/00h00m06s warp_perceptual : 124.12;loss_G_init : 0.00;loss_D_init : 0.00loss_G_last : 0.00;loss_D_last : 0.00
Epoch 1 Iter 0 Step 0 event save
Traceback (most recent call last):
  File "main.py", line 125, in <module>
    mp.spawn(main, nprocs=params.ngpus, args=(params,))
  File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/disk/sde/wzr/MetaPortrait/base_model/main.py", line 118, in main
    train_ddp(args, conf, models, datasets)
  File "/disk/sde/wzr/MetaPortrait/base_model/train_ddp.py", line 114, in train_ddp
    losses_G, generated = G_full(data, stage=args["stage"])
  File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 787, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. Since `find_unused_parameters=True` is enabled, this likely  means that not all `forward` output
s participate in computing loss. You can fix this by making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporti
ng this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 140 141 142 143
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

my evn is python3.8 torch1.9.1+cu111

@wzr0108
Copy link
Author

wzr0108 commented Apr 30, 2023

I modified th code in tain_ddp.py

with torch.cuda.amp.autocast():
        losses_G, generated = G_full(data, stage=args["stage"])
        loss_G = sum([val.mean() for val in losses_G.values()])
        # avoid ddp bug
        for k, v in generated.items():
            loss_G += v.mean() * 0.0
scaler.scale(loss_G).backward()

There is no error in the code, but I am not sure if the result is correct

@ForeverFancy
Copy link
Collaborator

Thanks for pointing this out, I will check it later.

@xueziii
Copy link

xueziii commented Jul 25, 2023

I want to train a model from scratch, but how should I prepare the training data? I don't know how to get the ldmk, theta, id and map_dict files and put them in the correct place

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants