Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1080Ti is out of memory for testing 1024P pretrained model #19

Open
nejyeah opened this issue Feb 8, 2018 · 20 comments
Open

1080Ti is out of memory for testing 1024P pretrained model #19

nejyeah opened this issue Feb 8, 2018 · 20 comments

Comments

@nejyeah
Copy link

nejyeah commented Feb 8, 2018

pytorch_pix2pixHD |     result = self.forward(*input, **kwargs)
pytorch_pix2pixHD |   File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 282, in forward
pytorch_pix2pixHD |     self.padding, self.dilation, self.groups)
pytorch_pix2pixHD | RuntimeError: cuda runtime error (2) : out of memory at /tmp/pip-z3dlenmr-build/aten/src/THC/generic/THCStorage.cu:58

So could you offer 512p prtrained model for testing?

@tcwang0509
Copy link
Collaborator

1080Ti should be able to run the inference perfectly fine; it should only take about 4G memory. Are you sure the GPU is not running something else at the same time?

@nejyeah
Copy link
Author

nejyeah commented Feb 8, 2018

I am sure there is no other jobs running at the same time.
Pytorch is built through docker images. Here is the Dockerfile and docker-compose file.

# Dockerfile
FROM pytorch-cuda8-cudnn6:gpu-py3
RUN mkdir /app \
    && pip install dominate
WORKDIR /app

docker-compose.yml

version: '2'
services:
  pix2pixHD:
    build: .
    image: pytorch/pix2pixhd:gpu-py3
    container_name: pytorch_pix2pixHD
    volumes:
      - .:/app
    #environment:
    #  - CUDA_VISIBLE_DEVICES=0
    command:
      - bash
      - ./scripts/test_1024p.sh

Error information:

pytorch_pix2pixHD | ---------- Networks initialized -------------
pytorch_pix2pixHD | model [Pix2PixHDModel] was created
pytorch_pix2pixHD | THCudaCheck FAIL file=/tmp/pip-z3dlenmr-build/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
pytorch_pix2pixHD | /app/models/pix2pixHD_model.py:112: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
pytorch_pix2pixHD |   input_label = Variable(input_label, volatile=infer)
pytorch_pix2pixHD | process image... ['./datasets/cityscapes/test_label/frankfurt_000000_000576_gtFine_labelIds.png']
pytorch_pix2pixHD | Traceback (most recent call last):
pytorch_pix2pixHD |   File "test.py", line 29, in <module>
pytorch_pix2pixHD |     generated = model.inference(data['label'], data['inst'])
pytorch_pix2pixHD |   File "/app/models/pix2pixHD_model.py", line 188, in inference
pytorch_pix2pixHD |     fake_image = self.netG.forward(input_concat)
pytorch_pix2pixHD |   File "/app/models/networks.py", line 182, in forward
pytorch_pix2pixHD |     output_prev = model_upsample(model_downsample(input_i) + output_prev)
pytorch_pix2pixHD |   File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
pytorch_pix2pixHD |     result = self.forward(*input, **kwargs)
pytorch_pix2pixHD |   File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/container.py", line 75, in forward
pytorch_pix2pixHD |     input = module(input)
pytorch_pix2pixHD |   File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
pytorch_pix2pixHD |     result = self.forward(*input, **kwargs)
pytorch_pix2pixHD |   File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 282, in forward
pytorch_pix2pixHD |     self.padding, self.dilation, self.groups)
pytorch_pix2pixHD | RuntimeError: cuda runtime error (2) : out of memory at /tmp/pip-z3dlenmr-build/aten/src/THC/generic/THCStorage.cu:58

That's wired!

@arthur-qiu
Copy link

I meet similar problem. I solve it by adding proper options. You may need to read the "readme" carefully.

@xmengli
Copy link

xmengli commented Apr 7, 2018

@tcwang0509
Thanks for your excellent work!!

I run the inference code bash ./scripts/test_1024p.sh on my server but it shows error:
I specify the batchSize to 1.

---------- Networks initialized -------------
Pretrained network G has fewer layers; The following are not initialized:
['model', 'model1_1']
model [Pix2PixHDModel] was created
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1513363039688/work/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "test.py", line 29, in <module>
    generated = model.inference(data['label'], data['inst'])
  File "/home/xmli/pheng4/pix2pixHD/models/pix2pixHD_model.py", line 188, in inference
    fake_image = self.netG.forward(input_concat)
  File "/home/xmli/pheng4/pix2pixHD/models/networks.py", line 182, in forward
    output_prev = model_upsample(model_downsample(input_i) + output_prev)
  File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward
    input = module(input)
  File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 277, in forward
    self.padding, self.dilation, self.groups)
  File "/home/xmli/anaconda2/envs/python2/lib/python2.7/site-packages/torch/nn/functional.py", line 90, in conv2d
    return f(input, weight, bias)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1513363039688/work/torch/lib/THC/generic/THCStorage.cu:58

I run with TiTan XP and I used an empty GPU for the inference:
My torch version is 0.3.0

nvidia-smi
Sat Apr  7 19:19:50 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 0000:04:00.0      On |                  N/A |
| 28%   49C    P2    61W / 250W |    251MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 0000:05:00.0     Off |                  N/A |
| 50%   78C    P2   269W / 250W |  10280MiB / 12189MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 0000:08:00.0     Off |                  N/A |
| 23%   36C    P8    16W / 250W |      3MiB / 12189MiB |      0%      Default |

@xmengli
Copy link

xmengli commented Apr 7, 2018

@tcwang0509 @ArthurQiuu Could you provide any solutions to the problems? Thanks so much!!

@xmengli
Copy link

xmengli commented Apr 8, 2018

The problem solved when I update the torch version from 0.3.0 to 0.3.1.post2
I posted my pytorch info.
Thanks all!


$ conda list | grep pytorch
cuda80                    1.0                  h205658b_0    pytorch
pytorch                   0.3.1           py27_cuda8.0.61_cudnn7.0.5_2    pytorch
torchvision               0.2.0            py27hfb27419_1    pytorch

@borisfom
Copy link
Contributor

borisfom commented May 9, 2018

I am running ToT Pytorch and 1024p does not fit in 16G by default for inference (test.py). I have added FP16 option (see my PR) to make it fit.

@cchen156
Copy link

I meet the same problem when using a Titan X GPU to test the pre-trained 1024p model. Did anyone solve the out-of-memory problem?

@tcwang0509 Is it possible to provide the 512p pre-trained model for testing? Thank you!

@hahakid
Copy link

hahakid commented Jun 27, 2018

I meet the same problem on 1080ti, I run the program on an empty GPU, it failed, but I can still get two pics.
So I read the options.py and comments the --resize_or_crop none, it can work but the generated images(1024×512) are not so well as expected. When using the default --resize_or_crop==scale_width, I can get only one generated image(2048*1024), it is much better.

therefore, I try to train my own models, using /scripts/train_512p.sh/
I have the following problem,
create web directory ./checkpoints/label2city_512p/web...
Traceback (most recent call last):
File "train.py", line 61, in
Variable(data['image']), Variable(data['feat']), infer=save_fake)
File "/home/zfserver/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/zfserver/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 112, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/zfserver/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/media/zfserver/ouyang/gan/pix2pixHD/models/pix2pixHD_model.py", line 154, in forward
input_label, inst_map, real_image, feat_map = self.encode_input(label, inst, image, feat)
File "/media/zfserver/ouyang/gan/pix2pixHD/models/pix2pixHD_model.py", line 122, in encode_input
if self.opt.data_type==16:
AttributeError: 'Namespace' object has no attribute 'data_type'

actually, all the other train scripts generate the same issues.
Any help?

the datasets are managed as follows.
train_img: ****leftImg8bit.png
train_inst:****gtFine_instanceIds.png
train_label:****gtFine_laelIds.png

@hahakid
Copy link

hahakid commented Jun 28, 2018

@tcwang0509 I tries different combinations of parameters in the test_1024p.sh, I found that the --ngf highly affect the memory. I also watch the memory composition during running, the training of 512 may only use about 4Gb, however, the testing will eat much more. Reduce the number of --ngf to 20 can make sure the testing but the quality of images are very strange. I tested on both 1080ti and titan x.

@tcwang0509
Copy link
Collaborator

@ouyangkid are you using pytorch 0.4? It seems the problem is due to volatile not supported anymore, so inference costs a lot more memory than it should. Please pull the latest version and see if it works.

@hahakid
Copy link

hahakid commented Jun 29, 2018

@tcwang0509 Yes, thanks for your response, it seems that the last version will be 1.0, but not publicly available. I will wait and try after they published the official version.

@marioft
Copy link

marioft commented Jul 3, 2018

@ouyangkid I got the same error as you "... AttributeError: 'Namespace' object has no attribute 'data_type'". Did you only change the --ngf parameter? I have already tried that and did not work.
Thanks in advance.

@hahakid
Copy link

hahakid commented Jul 3, 2018

@marioft according to @tcwang0509, the problem is because of the versions of different software, as I tried, reduce the parms of --ngf is one of the operations that can decrease the memory consumptions of the GPUs, however, the outputs are wired.

I suggest you wait for the new version of pytorch 1.0 / tensorrt. As you can see, the nvidia has only one guy support on this project currently, I also give up any test.
this is my envs:
cuda9.0 cudnn 7.1.5 tensorrt 4.0 pytorch 4.0

@marioft
Copy link

marioft commented Jul 3, 2018

Thanks for your reply, I'll update the software then and hope it works. I'm working with Cuda7.5, cudnn7.1.3, tensorrt 4.0.1, and pytorch 0.4.0.

@Avyukth
Copy link

Avyukth commented Oct 25, 2018

I ran the code with default bash ./scripts/test_1024p.sh its working fine with pytorch 0.4 then I repalce the train label with custom same dimension image as given in the test case 1024x2048 its throws bellow error
Traceback (most recent call last):
File "test.py", line 61, in
generated = model.inference(data['label'], data['inst'])
File "project/pix2pixHD/models/pix2pixHD_model.py", line 216, in inference
fake_image = self.netG.forward(input_concat)
File "project/pix2pixHD/models/networks.py", line 180, in forward
output_prev = self.model(input_downsampled[-1])
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 282, in forward
self.padding, self.dilation, self.groups)
File "anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py", line 90, in conv2d
return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_INTERNAL_ERROR

any insight thanks in advance

@ghost
Copy link

ghost commented Nov 17, 2018

Hi @nejyeah I am trying to run pix2pixHD using a Docker container. I user your Dockerfile, but this line

FROM pytorch-cuda8-cudnn6:gpu-py3

raise an error:

pull access denied for pytorch-cuda8-cudnn6, repository does not exist or may require 'docker login'

Can you help me dockerize pix2pixHD?

@nejyeah
Copy link
Author

nejyeah commented Nov 19, 2018

@fabio-C Sorry, I did not keep the dockerfile and the docker image.

@9of9
Copy link

9of9 commented Dec 11, 2018

If you're using pyTorch 1.0.0, you'll also get a CUDA out of memory error. You'll want to find line 214 in pix2pixHD_model.py and comment out

        if torch.__version__.startswith('0.4'):
            with torch.no_grad():
                fake_image = self.netG.forward(input_concat)
        else:
            fake_image = self.netG.forward(input_concat)

And replace it with just

        with torch.no_grad():
            fake_image = self.netG.forward(input_concat)

Or your own, improved, pyTorch version-detecting code. with torch.no_grad() is correct for pyTorch 0.4, but should also be used for later versions of pyTorch, which this code does not do.

@royaljain
Copy link

@9of9's solution worked for me (Thanks !). I noted one interesting thing though, if I pass --resize_or_crop none, then I don't get out of memory ( although the output images don't make sense ). OOM occurs only when --resize_or_crop == scale_width

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests