Unable to resize error #39

leedh0124 · 2021-05-10T06:04:46Z

Hi. Many thanks for making your work public. It's been a pleasure reading your paper.

I tried running the code on Spyder. It works fine until at one point, it hits the following runtime error.

Start train epoch 12, lr=0.0001 for run run_20210510T145253
Evaluating baseline on dataset...
100%|██████████| 10/10 [00:00<00:00, 22.94it/s]
100%|██████████| 10/10 [00:03<00:00, 3.04it/s]
100%|██████████| 1/1 [00:00<00:00, 23.44it/s]
100%|██████████| 1/1 [00:00<00:00, 22.40it/s]
Finished epoch 12, took 00:00:03 s
Saving model and state...
Validating...
Validation overall avg_cost: -7.61328125 +- 0.06633966416120529
Evaluating candidate model on evaluation dataset
Epoch 12 candidate mean -7.60546875, baseline epoch 11 mean -7.64453125, difference 0.0390625
Start train epoch 13, lr=0.0001 for run run_20210510T145253
30%|███ | 3/10 [00:00<00:00, 22.74it/s]Evaluating baseline on dataset...
100%|██████████| 10/10 [00:00<00:00, 22.78it/s]
0%| | 0/10 [00:00<?, ?it/s]
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\user\anaconda3\envs\attentionVRP\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\user\anaconda3\envs\attentionVRP\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "C:\Users\user\anaconda3\envs\attentionVRP\lib\site-packages\torch\multiprocessing\reductions.py", line 88, in rebuild_tensor
t = torch._utils._rebuild_tensor(storage, storage_offset, size, stride)
File "C:\Users\user\anaconda3\envs\attentionVRP\lib\site-packages\torch_utils.py", line 133, in rebuild_tensor
return t.set(storage, storage_offset, size, stride)
RuntimeError: Trying to resize storage that is not resizable at ..\aten\src\TH\THStorageFunctions.cpp:87

The problem is op with const data distribution. To make problem simple, I set graph_size as 20, batch_size 512, epoch_size as 5120, eval_batch_size 512, and 100 epochs. Other parameters are set as before.

Any idea to tackle this problem?

Thanks in advance!

wouterkool · 2021-05-10T20:30:30Z

Hi! Thanks for reporting. This sounds like a very weird issue which I have not seen before. Does it occur consistently, i.e. is it reproducible? Can you try to isolate the issue?

From your trace and a quick google search, it seems related to a zero dimensional tensor being pickled, maybe you can investigate if/why this happens?

leedh0124 · 2021-05-11T11:53:32Z

Hi! Thanks for responding quickly. It occurs on a random basis. Sometimes, the algorithm goes on until 100 epochs. Sometimes, this happens. It seems that this error happens right before line 84 of train.py as below.

# Put model in train mode!
model.train()
set_decode_type(model, "sampling")

for batch_id, batch in enumerate(tqdm(training_dataloader, disable=opts.no_progress_bar)):   <- here

I haven't made any changes to the existing code. Could I know which exact version of PyTorch(>1.7.0) and Python(>3.8) this code is based on? Im using Pytorch 1.7.1 and Python 3.8.3 btw.

TimD3 · 2021-07-29T21:12:41Z

Did you find a fix? I'm running into the same error and it seems to be in the same line. I don't understand where it would even come from. I don't see where in this part of the code zero dimensional tensors could appear or what gets pickled there.

Python 3.8.3 and PyTorch 1.8.1 btw.

Edit: I figured out it happens when enumerate(training_dataloader) is called and it can be avoided when setting the number of workers of the data loader to 0, I am unsure however why that happens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to resize error #39

Unable to resize error #39

leedh0124 commented May 10, 2021

wouterkool commented May 10, 2021

leedh0124 commented May 11, 2021

TimD3 commented Jul 29, 2021 •

edited

Loading

Unable to resize error #39

Unable to resize error #39

Comments

leedh0124 commented May 10, 2021

I tried running the code on Spyder. It works fine until at one point, it hits the following runtime error.

wouterkool commented May 10, 2021

leedh0124 commented May 11, 2021

Hi! Thanks for responding quickly. It occurs on a random basis. Sometimes, the algorithm goes on until 100 epochs. Sometimes, this happens. It seems that this error happens right before line 84 of train.py as below.

TimD3 commented Jul 29, 2021 • edited Loading

TimD3 commented Jul 29, 2021 •

edited

Loading