You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
start to train...
Epoch 1 Iter 0 D/Time : 6.768/00h00m06s warp_perceptual : 124.12;loss_G_init : 0.00;loss_D_init : 0.00loss_G_last : 0.00;loss_D_last : 0.00
Epoch 1 Iter 0 Step 0 event save
Traceback (most recent call last):
File "main.py", line 125, in <module>
mp.spawn(main, nprocs=params.ngpus, args=(params,))
File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/disk/sde/wzr/MetaPortrait/base_model/main.py", line 118, in main
train_ddp(args, conf, models, datasets)
File "/disk/sde/wzr/MetaPortrait/base_model/train_ddp.py", line 114, in train_ddp
losses_G, generated = G_full(data, stage=args["stage"])
File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/disk/sdb/wzr/miniforge3/envs/torch1.9/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 787, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. Since `find_unused_parameters=True` is enabled, this likely means that not all `forward` output
s participate in computing loss. You can fix this by making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporti
ng this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 140 141 142 143
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
my evn is python3.8 torch1.9.1+cu111
The text was updated successfully, but these errors were encountered:
with torch.cuda.amp.autocast():
losses_G, generated = G_full(data, stage=args["stage"])
loss_G = sum([val.mean() for val in losses_G.values()])
# avoid ddp bug
for k, v in generated.items():
loss_G += v.mean() * 0.0
scaler.scale(loss_G).backward()
There is no error in the code, but I am not sure if the result is correct
I want to train a model from scratch, but how should I prepare the training data? I don't know how to get the ldmk, theta, id and map_dict files and put them in the correct place
when I run the code following this command
I met this error
my evn is python3.8 torch1.9.1+cu111
The text was updated successfully, but these errors were encountered: