Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: out of memory with 16GB-memory GPU #5

Open
JinChengneng opened this issue Sep 16, 2020 · 3 comments
Open

Comments

@JinChengneng
Copy link

Hi there! I am really interested in your repository and thanks for your efforts to latent-gan.

However, I am facing a problem while I am training through the entire process by executing python run.py -sf data/EGFR_training.smi.

The error messages are shown below.

Traceback (most recent call last):
  File "run.py", line 54, in <module>
    runner.run()
  File "run.py", line 32, in run
    decode_mols_save_path=self.decoded_smiles,n_epochs=self.n_epochs,sample_after_training=self.sample_size)
  File "/home/jinchengneng/latent-gan/runners/TrainModelRunner.py", line 69, in __init__
    self.G.cuda()
  File "/opt/anaconda3/envs/latent-gan/lib/python3.6/site-packages/torch/nn/modules/module.py", line 260, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/opt/anaconda3/envs/latent-gan/lib/python3.6/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/opt/anaconda3/envs/latent-gan/lib/python3.6/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/opt/anaconda3/envs/latent-gan/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
    param.data = fn(param.data)
  File "/opt/anaconda3/envs/latent-gan/lib/python3.6/site-packages/torch/nn/modules/module.py", line 260, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: out of memory

I have monitored the GPU via watch nvidia-smi and the GPU memory usage is very large while loading models. The screenshot is shown below. Once the model started to train models, the program exited with the error RuntimeError: CUDA error: out of memory.
WX20200916-135651

According to my experience, 16GB GPU memory is enough for most programs. I would appreciate it if you would take a look if there are any memory leaks or anything others wrong.

@JinChengneng
Copy link
Author

UPDATE:

I solved this problem by adding an extra GPU memory limit. If you are facing the same out of memory problem, you can try to add these codes to the top of file run.py. I hope it may help you.

import tensorflow
tf_config = tensorflow.ConfigProto()  
tf_config.gpu_options.per_process_gpu_memory_fraction = 0.8
session = tensorflow.Session(config=tf_config)

@SeemonJ
Copy link
Collaborator

SeemonJ commented Mar 10, 2021

image

Hi,
Sorry for taking so long to get back to you. I am attaching a screenshot of my memory usage while executing python run.py, which has the EGFR training set as the default input. Now, I have commited a few software updates in the past days, but none of them affect the actual system.

I can't seem to reproduce the issue that you are having.
By this time, it might not be relevant for you anymore, but I'd love to hear if there was any further observations you might have made.

Best,
Simon

@muammar
Copy link

muammar commented Feb 8, 2022

This worked for me. I also needed to select a specific GPU and could do it from bash appending CUDA_VISIBLE_DEVICES=0 before invoking python:

CUDA_VISIBLE_DEVICES=0 python3 run.py --flags

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants