1080Ti is out of memory for testing 1024P pretrained model #19

nejyeah · 2018-02-08T02:11:44Z

pytorch_pix2pixHD |     result = self.forward(*input, **kwargs)
pytorch_pix2pixHD |   File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 282, in forward
pytorch_pix2pixHD |     self.padding, self.dilation, self.groups)
pytorch_pix2pixHD | RuntimeError: cuda runtime error (2) : out of memory at /tmp/pip-z3dlenmr-build/aten/src/THC/generic/THCStorage.cu:58

So could you offer 512p prtrained model for testing?

The text was updated successfully, but these errors were encountered:

tcwang0509 · 2018-02-08T03:56:34Z

1080Ti should be able to run the inference perfectly fine; it should only take about 4G memory. Are you sure the GPU is not running something else at the same time?

nejyeah · 2018-02-08T07:43:02Z

I am sure there is no other jobs running at the same time.
Pytorch is built through docker images. Here is the Dockerfile and docker-compose file.

# Dockerfile
FROM pytorch-cuda8-cudnn6:gpu-py3
RUN mkdir /app \
    && pip install dominate
WORKDIR /app

docker-compose.yml

version: '2'
services:
  pix2pixHD:
    build: .
    image: pytorch/pix2pixhd:gpu-py3
    container_name: pytorch_pix2pixHD
    volumes:
      - .:/app
    #environment:
    #  - CUDA_VISIBLE_DEVICES=0
    command:
      - bash
      - ./scripts/test_1024p.sh

Error information:

pytorch_pix2pixHD | ---------- Networks initialized -------------
pytorch_pix2pixHD | model [Pix2PixHDModel] was created
pytorch_pix2pixHD | THCudaCheck FAIL file=/tmp/pip-z3dlenmr-build/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
pytorch_pix2pixHD | /app/models/pix2pixHD_model.py:112: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
pytorch_pix2pixHD |   input_label = Variable(input_label, volatile=infer)
pytorch_pix2pixHD | process image... ['./datasets/cityscapes/test_label/frankfurt_000000_000576_gtFine_labelIds.png']
pytorch_pix2pixHD | Traceback (most recent call last):
pytorch_pix2pixHD |   File "test.py", line 29, in <module>
pytorch_pix2pixHD |     generated = model.inference(data['label'], data['inst'])
pytorch_pix2pixHD |   File "/app/models/pix2pixHD_model.py", line 188, in inference
pytorch_pix2pixHD |     fake_image = self.netG.forward(input_concat)
pytorch_pix2pixHD |   File "/app/models/networks.py", line 182, in forward
pytorch_pix2pixHD |     output_prev = model_upsample(model_downsample(input_i) + output_prev)
pytorch_pix2pixHD |   File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
pytorch_pix2pixHD |     result = self.forward(*input, **kwargs)
pytorch_pix2pixHD |   File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/container.py", line 75, in forward
pytorch_pix2pixHD |     input = module(input)
pytorch_pix2pixHD |   File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
pytorch_pix2pixHD |     result = self.forward(*input, **kwargs)
pytorch_pix2pixHD |   File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 282, in forward
pytorch_pix2pixHD |     self.padding, self.dilation, self.groups)
pytorch_pix2pixHD | RuntimeError: cuda runtime error (2) : out of memory at /tmp/pip-z3dlenmr-build/aten/src/THC/generic/THCStorage.cu:58

That's wired!

arthur-qiu · 2018-04-03T01:17:28Z

I meet similar problem. I solve it by adding proper options. You may need to read the "readme" carefully.

xmengli · 2018-04-07T11:22:35Z

@tcwang0509
Thanks for your excellent work!!

I run the inference code bash ./scripts/test_1024p.sh on my server but it shows error:
I specify the batchSize to 1.

---------- Networks initialized -------------
Pretrained network G has fewer layers; The following are not initialized:
['model', 'model1_1']
model [Pix2PixHDModel] was created
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1513363039688/work/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "test.py", line 29, in <module>
    generated = model.inference(data['label'], data['inst'])
  File "/home/xmli/pheng4/pix2pixHD/models/pix2pixHD_model.py", line 188, in inference
    fake_image = self.netG.forward(input_concat)
  File "/home/xmli/pheng4/pix2pixHD/models/networks.py", line 182, in forward
    output_prev = model_upsample(model_downsample(input_i) + output_prev)
  File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward
    input = module(input)
  File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 277, in forward
    self.padding, self.dilation, self.groups)
  File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/functional.py", line 90, in conv2d
    return f(input, weight, bias)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1513363039688/work/torch/lib/THC/generic/THCStorage.cu:58

I run with TiTan XP and I used an empty GPU for the inference:
My torch version is 0.3.0

nvidia-smi
Sat Apr  7 19:19:50 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 0000:04:00.0      On |                  N/A |
| 28%   49C    P2    61W / 250W |    251MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 0000:05:00.0     Off |                  N/A |
| 50%   78C    P2   269W / 250W |  10280MiB / 12189MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 0000:08:00.0     Off |                  N/A |
| 23%   36C    P8    16W / 250W |      3MiB / 12189MiB |      0%      Default |

xmengli · 2018-04-07T11:23:53Z

@tcwang0509 @ArthurQiuu Could you provide any solutions to the problems? Thanks so much!!

xmengli · 2018-04-08T05:03:14Z

The problem solved when I update the torch version from 0.3.0 to 0.3.1.post2
I posted my pytorch info.
Thanks all!


$ conda list | grep pytorch
cuda80                    1.0                  h205658b_0    pytorch
pytorch                   0.3.1           py27_cuda8.0.61_cudnn7.0.5_2    pytorch
torchvision               0.2.0            py27hfb27419_1    pytorch

borisfom · 2018-05-09T00:56:55Z

I am running ToT Pytorch and 1024p does not fit in 16G by default for inference (test.py). I have added FP16 option (see my PR) to make it fit.

cchen156 · 2018-05-21T23:44:34Z

I meet the same problem when using a Titan X GPU to test the pre-trained 1024p model. Did anyone solve the out-of-memory problem?

@tcwang0509 Is it possible to provide the 512p pre-trained model for testing? Thank you!

hahakid · 2018-06-27T04:06:15Z

I meet the same problem on 1080ti, I run the program on an empty GPU, it failed, but I can still get two pics.
So I read the options.py and comments the --resize_or_crop none, it can work but the generated images（1024×512） are not so well as expected. When using the default --resize_or_crop==scale_width, I can get only one generated image(2048*1024)， it is much better.

therefore, I try to train my own models, using /scripts/train_512p.sh/
I have the following problem,
create web directory ./checkpoints/label2city_512p/web...
Traceback (most recent call last):
File "train.py", line 61, in
Variable(data['image']), Variable(data['feat']), infer=save_fake)
File "/home/zfserver/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/zfserver/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 112, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/zfserver/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/media/zfserver/ouyang/gan/pix2pixHD/models/pix2pixHD_model.py", line 154, in forward
input_label, inst_map, real_image, feat_map = self.encode_input(label, inst, image, feat)
File "/media/zfserver/ouyang/gan/pix2pixHD/models/pix2pixHD_model.py", line 122, in encode_input
if self.opt.data_type==16:
AttributeError: 'Namespace' object has no attribute 'data_type'

actually, all the other train scripts generate the same issues.
Any help?

the datasets are managed as follows.
train_img: ****leftImg8bit.png
train_inst:****gtFine_instanceIds.png
train_label:****gtFine_laelIds.png

hahakid · 2018-06-28T11:07:20Z

@tcwang0509 I tries different combinations of parameters in the test_1024p.sh, I found that the --ngf highly affect the memory. I also watch the memory composition during running, the training of 512 may only use about 4Gb, however, the testing will eat much more. Reduce the number of --ngf to 20 can make sure the testing but the quality of images are very strange. I tested on both 1080ti and titan x.

tcwang0509 · 2018-06-28T21:22:52Z

@ouyangkid are you using pytorch 0.4? It seems the problem is due to volatile not supported anymore, so inference costs a lot more memory than it should. Please pull the latest version and see if it works.

hahakid · 2018-06-29T06:36:08Z

@tcwang0509 Yes, thanks for your response, it seems that the last version will be 1.0, but not publicly available. I will wait and try after they published the official version.

marioft · 2018-07-03T10:00:45Z

@ouyangkid I got the same error as you "... AttributeError: 'Namespace' object has no attribute 'data_type'". Did you only change the --ngf parameter? I have already tried that and did not work.
Thanks in advance.

hahakid · 2018-07-03T10:26:28Z

@marioft according to @tcwang0509, the problem is because of the versions of different software, as I tried, reduce the parms of --ngf is one of the operations that can decrease the memory consumptions of the GPUs, however, the outputs are wired.

I suggest you wait for the new version of pytorch 1.0 / tensorrt. As you can see, the nvidia has only one guy support on this project currently, I also give up any test.
this is my envs:
cuda9.0 cudnn 7.1.5 tensorrt 4.0 pytorch 4.0

marioft · 2018-07-03T12:50:07Z

Thanks for your reply, I'll update the software then and hope it works. I'm working with Cuda7.5, cudnn7.1.3, tensorrt 4.0.1, and pytorch 0.4.0.

Avyukth · 2018-10-25T19:43:57Z

I ran the code with default bash ./scripts/test_1024p.sh its working fine with pytorch 0.4 then I repalce the train label with custom same dimension image as given in the test case 1024x2048 its throws bellow error
Traceback (most recent call last):
File "test.py", line 61, in
generated = model.inference(data['label'], data['inst'])
File "project/pix2pixHD/models/pix2pixHD_model.py", line 216, in inference
fake_image = self.netG.forward(input_concat)
File "project/pix2pixHD/models/networks.py", line 180, in forward
output_prev = self.model(input_downsampled[-1])
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 282, in forward
self.padding, self.dilation, self.groups)
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py", line 90, in conv2d
return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_INTERNAL_ERROR

any insight thanks in advance

ghost · 2018-11-17T20:14:01Z

Hi @nejyeah I am trying to run pix2pixHD using a Docker container. I user your Dockerfile, but this line

FROM pytorch-cuda8-cudnn6:gpu-py3

raise an error:

pull access denied for pytorch-cuda8-cudnn6, repository does not exist or may require 'docker login'

Can you help me dockerize pix2pixHD?

nejyeah · 2018-11-19T08:42:44Z

@fabio-C Sorry, I did not keep the dockerfile and the docker image.

9of9 · 2018-12-11T01:58:53Z

If you're using pyTorch 1.0.0, you'll also get a CUDA out of memory error. You'll want to find line 214 in pix2pixHD_model.py and comment out

        if torch.__version__.startswith('0.4'):
            with torch.no_grad():
                fake_image = self.netG.forward(input_concat)
        else:
            fake_image = self.netG.forward(input_concat)

And replace it with just

        with torch.no_grad():
            fake_image = self.netG.forward(input_concat)

Or your own, improved, pyTorch version-detecting code. with torch.no_grad() is correct for pyTorch 0.4, but should also be used for later versions of pyTorch, which this code does not do.

royaljain · 2018-12-13T14:47:05Z

@9of9's solution worked for me (Thanks !). I noted one interesting thing though, if I pass --resize_or_crop none, then I don't get out of memory ( although the output images don't make sense ). OOM occurs only when --resize_or_crop == scale_width

hahakid mentioned this issue Jun 26, 2018

TensorRT problem #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1080Ti is out of memory for testing 1024P pretrained model #19

1080Ti is out of memory for testing 1024P pretrained model #19

nejyeah commented Feb 8, 2018

tcwang0509 commented Feb 8, 2018

nejyeah commented Feb 8, 2018

arthur-qiu commented Apr 3, 2018

xmengli commented Apr 7, 2018

xmengli commented Apr 7, 2018

xmengli commented Apr 8, 2018

borisfom commented May 9, 2018

cchen156 commented May 21, 2018

hahakid commented Jun 27, 2018 •

edited

Loading

hahakid commented Jun 28, 2018

tcwang0509 commented Jun 28, 2018

hahakid commented Jun 29, 2018

marioft commented Jul 3, 2018

hahakid commented Jul 3, 2018

marioft commented Jul 3, 2018

Avyukth commented Oct 25, 2018

ghost commented Nov 17, 2018

nejyeah commented Nov 19, 2018

9of9 commented Dec 11, 2018 •

edited

Loading

royaljain commented Dec 13, 2018

1080Ti is out of memory for testing 1024P pretrained model #19

1080Ti is out of memory for testing 1024P pretrained model #19

Comments

nejyeah commented Feb 8, 2018

tcwang0509 commented Feb 8, 2018

nejyeah commented Feb 8, 2018

arthur-qiu commented Apr 3, 2018

xmengli commented Apr 7, 2018

xmengli commented Apr 7, 2018

xmengli commented Apr 8, 2018

borisfom commented May 9, 2018

cchen156 commented May 21, 2018

hahakid commented Jun 27, 2018 • edited Loading

hahakid commented Jun 28, 2018

tcwang0509 commented Jun 28, 2018

hahakid commented Jun 29, 2018

marioft commented Jul 3, 2018

hahakid commented Jul 3, 2018

marioft commented Jul 3, 2018

Avyukth commented Oct 25, 2018

ghost commented Nov 17, 2018

nejyeah commented Nov 19, 2018

9of9 commented Dec 11, 2018 • edited Loading

royaljain commented Dec 13, 2018

hahakid commented Jun 27, 2018 •

edited

Loading

9of9 commented Dec 11, 2018 •

edited

Loading