problem with resuming training from checkpoint #62

mailtohrishi · 2022-11-16T12:39:22Z

Hi... I am getting following error while resuming training from a checkpoint on a single GPU system. The training went fine when started from 0th iteration, but exited immediately after loading a checkpoint. The relevant excerpt that I have modified in main.py for that purpose is also shown below. Is it a bug or there's some mistake somewhere?

(command used)
sh scripts/cityscapes/ocrnet/run_r_101_d_8_ocrnet_train.sh resume x3

(modifications in main.py: ignore single quotes typed in here for proper display)
elif [ "$1"x == "resume"x ]; then
${PYTHON} -u main.py --configs '$'{CONFIGS} \
--drop_last y \
--phase train \
--gathered n \
--loss_balance y \
--log_to_file n \
--backbone ${BACKBONE} \
--model_name ${MODEL_NAME} \
--max_iters ${MAX_ITERS} \
--data_dir ${DATA_DIR} \
--loss_type ${LOSS_TYPE} \
--resume_continue y \
--resume ${CHECKPOINTS_ROOT}/checkpoints/bottle/'$'{CHECKPOINTS_NAME}_latest.pth \
--checkpoints_name ${CHECKPOINTS_NAME} \
--distributed False \
2>&1 | tee -a ${LOG_FILE}
#--gpu 0 1 2 3 **

2022-11-16 11:30:47,097 INFO [module_runner.py, 87] Loading checkpoint from /workspace/data/defGen/graphics/Pre_CL_x3//..//checkpoints/bottle/spatial_ocrnet_deepbase_resnet101_dilated8_x3_latest.pth...
2022-11-16 11:30:47,283 INFO [trainer.py, 90] Params Group Method: None
2022-11-16 11:30:47,285 INFO [optim_scheduler.py, 96] Use lambda_poly policy with default power 0.9
2022-11-16 11:30:47,285 INFO [data_loader.py, 132] use the DefaultLoader for train...
2022-11-16 11:30:47,773 INFO [default_loader.py, 38] train 501
2022-11-16 11:30:47,774 INFO [data_loader.py, 164] use DefaultLoader for val ...
2022-11-16 11:30:47,873 INFO [default_loader.py, 38] val 126
2022-11-16 11:30:47,873 INFO [loss_manager.py, 66] use loss: fs_auxce_loss.
2022-11-16 11:30:47,874 INFO [loss_manager.py, 55] use DataParallelCriterion loss
2022-11-16 11:30:48,996 INFO [data_helper.py, 126] Input keys: ['img']
2022-11-16 11:30:48,996 INFO [data_helper.py, 127] Target keys: ['labelmap']
Traceback (most recent call last):
File "main.py", line 227, in
model.train()
File "/workspace/defGen/External/ContrastiveSeg-main/segmentor/trainer.py", line 390, in train
self.__train()
File "/workspace/defGen/External/ContrastiveSeg-main/segmentor/trainer.py", line 196, in __train
backward_loss = display_loss = self.pixel_loss(outputs, targets,
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/defGen/External/ContrastiveSeg-main/lib/extensions/parallel/data_parallel.py", line 125, in forward
return self.module(inputs[0], *targets[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/defGen/External/ContrastiveSeg-main/lib/loss/loss_helper.py", line 309, in forward
seg_loss = self.ce_loss(seg_out, targets)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/defGen/External/ContrastiveSeg-main/lib/loss/loss_helper.py", line 203, in forward
target = self._scale_target(targets[0], (inputs.size(2), inputs.size(3)))
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problem with resuming training from checkpoint #62

problem with resuming training from checkpoint #62

mailtohrishi commented Nov 16, 2022

problem with resuming training from checkpoint #62

problem with resuming training from checkpoint #62

Comments

mailtohrishi commented Nov 16, 2022