You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi... I am getting following error while resuming training from a checkpoint on a single GPU system. The training went fine when started from 0th iteration, but exited immediately after loading a checkpoint. The relevant excerpt that I have modified in main.py for that purpose is also shown below. Is it a bug or there's some mistake somewhere?
(command used) sh scripts/cityscapes/ocrnet/run_r_101_d_8_ocrnet_train.sh resume x3
(modifications in main.py: ignore single quotes typed in here for proper display)
elif [ "$1"x == "resume"x ]; then
${PYTHON} -u main.py --configs '$'{CONFIGS} \
--drop_last y \
--phase train \
--gathered n \
--loss_balance y \
--log_to_file n \
--backbone ${BACKBONE} \
--model_name ${MODEL_NAME} \
--max_iters ${MAX_ITERS} \
--data_dir ${DATA_DIR} \
--loss_type ${LOSS_TYPE} \
--resume_continue y \
--resume ${CHECKPOINTS_ROOT}/checkpoints/bottle/'$'{CHECKPOINTS_NAME}_latest.pth \
--checkpoints_name ${CHECKPOINTS_NAME} \
--distributed False \
2>&1 | tee -a ${LOG_FILE}
#--gpu 0 1 2 3 **
2022-11-16 11:30:47,097 INFO [module_runner.py, 87] Loading checkpoint from /workspace/data/defGen/graphics/Pre_CL_x3//..//checkpoints/bottle/spatial_ocrnet_deepbase_resnet101_dilated8_x3_latest.pth...
2022-11-16 11:30:47,283 INFO [trainer.py, 90] Params Group Method: None
2022-11-16 11:30:47,285 INFO [optim_scheduler.py, 96] Use lambda_poly policy with default power 0.9
2022-11-16 11:30:47,285 INFO [data_loader.py, 132] use the DefaultLoader for train...
2022-11-16 11:30:47,773 INFO [default_loader.py, 38] train 501
2022-11-16 11:30:47,774 INFO [data_loader.py, 164] use DefaultLoader for val ...
2022-11-16 11:30:47,873 INFO [default_loader.py, 38] val 126
2022-11-16 11:30:47,873 INFO [loss_manager.py, 66] use loss: fs_auxce_loss.
2022-11-16 11:30:47,874 INFO [loss_manager.py, 55] use DataParallelCriterion loss
2022-11-16 11:30:48,996 INFO [data_helper.py, 126] Input keys: ['img']
2022-11-16 11:30:48,996 INFO [data_helper.py, 127] Target keys: ['labelmap']
Traceback (most recent call last):
File "main.py", line 227, in
model.train()
File "/workspace/defGen/External/ContrastiveSeg-main/segmentor/trainer.py", line 390, in train
self.__train()
File "/workspace/defGen/External/ContrastiveSeg-main/segmentor/trainer.py", line 196, in __train
backward_loss = display_loss = self.pixel_loss(outputs, targets,
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/defGen/External/ContrastiveSeg-main/lib/extensions/parallel/data_parallel.py", line 125, in forward
return self.module(inputs[0], *targets[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/defGen/External/ContrastiveSeg-main/lib/loss/loss_helper.py", line 309, in forward
seg_loss = self.ce_loss(seg_out, targets)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/defGen/External/ContrastiveSeg-main/lib/loss/loss_helper.py", line 203, in forward
target = self._scale_target(targets[0], (inputs.size(2), inputs.size(3)))
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
The text was updated successfully, but these errors were encountered:
Hi... I am getting following error while resuming training from a checkpoint on a single GPU system. The training went fine when started from 0th iteration, but exited immediately after loading a checkpoint. The relevant excerpt that I have modified in main.py for that purpose is also shown below. Is it a bug or there's some mistake somewhere?
(command used)
sh scripts/cityscapes/ocrnet/run_r_101_d_8_ocrnet_train.sh resume x3
(modifications in main.py: ignore single quotes typed in here for proper display)
elif [ "$1"x == "resume"x ]; then
${PYTHON} -u main.py --configs '$'{CONFIGS} \
--drop_last y \
--phase train \
--gathered n \
--loss_balance y \
--log_to_file n \
--backbone ${BACKBONE} \
--model_name ${MODEL_NAME} \
--max_iters ${MAX_ITERS} \
--data_dir ${DATA_DIR} \
--loss_type ${LOSS_TYPE} \
--resume_continue y \
--resume ${CHECKPOINTS_ROOT}/checkpoints/bottle/'$'{CHECKPOINTS_NAME}_latest.pth \
--checkpoints_name ${CHECKPOINTS_NAME} \
--distributed False \
2>&1 | tee -a ${LOG_FILE}
#--gpu 0 1 2 3 **
2022-11-16 11:30:47,097 INFO [module_runner.py, 87] Loading checkpoint from /workspace/data/defGen/graphics/Pre_CL_x3//..//checkpoints/bottle/spatial_ocrnet_deepbase_resnet101_dilated8_x3_latest.pth...
2022-11-16 11:30:47,283 INFO [trainer.py, 90] Params Group Method: None
2022-11-16 11:30:47,285 INFO [optim_scheduler.py, 96] Use lambda_poly policy with default power 0.9
2022-11-16 11:30:47,285 INFO [data_loader.py, 132] use the DefaultLoader for train...
2022-11-16 11:30:47,773 INFO [default_loader.py, 38] train 501
2022-11-16 11:30:47,774 INFO [data_loader.py, 164] use DefaultLoader for val ...
2022-11-16 11:30:47,873 INFO [default_loader.py, 38] val 126
2022-11-16 11:30:47,873 INFO [loss_manager.py, 66] use loss: fs_auxce_loss.
2022-11-16 11:30:47,874 INFO [loss_manager.py, 55] use DataParallelCriterion loss
2022-11-16 11:30:48,996 INFO [data_helper.py, 126] Input keys: ['img']
2022-11-16 11:30:48,996 INFO [data_helper.py, 127] Target keys: ['labelmap']
Traceback (most recent call last):
File "main.py", line 227, in
model.train()
File "/workspace/defGen/External/ContrastiveSeg-main/segmentor/trainer.py", line 390, in train
self.__train()
File "/workspace/defGen/External/ContrastiveSeg-main/segmentor/trainer.py", line 196, in __train
backward_loss = display_loss = self.pixel_loss(outputs, targets,
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/defGen/External/ContrastiveSeg-main/lib/extensions/parallel/data_parallel.py", line 125, in forward
return self.module(inputs[0], *targets[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/defGen/External/ContrastiveSeg-main/lib/loss/loss_helper.py", line 309, in forward
seg_loss = self.ce_loss(seg_out, targets)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/defGen/External/ContrastiveSeg-main/lib/loss/loss_helper.py", line 203, in forward
target = self._scale_target(targets[0], (inputs.size(2), inputs.size(3)))
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
The text was updated successfully, but these errors were encountered: