Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval error: CUDA_ERROR_OUT_OF_MEMORY #8

Open
spark157 opened this issue Jul 30, 2019 · 6 comments
Open

Eval error: CUDA_ERROR_OUT_OF_MEMORY #8

spark157 opened this issue Jul 30, 2019 · 6 comments
Labels
bug Something isn't working

Comments

@spark157
Copy link

spark157 commented Jul 30, 2019

I have trained the fluorescence task model with:

!tape with model=unirep tasks=fluorescence load_from='pretrained_models/unirep_weights.h5' freeze_embedding_weights=True steps_per_epoch=100 datafile='data/fluorescence/fluorescence_train.tfrecords'

[Note used very small steps_per_epoch so it would train in a reasonable time so I can just get something working.]

Next I tried to evaluate the model using:

!tape-eval results/fluorescence_unirep_2019-07-30--17-22-15/ --datafile data/fluorescence/fluorescence_test.tfrecord'

but after only a few iterations the GPU memory use just explodes:

Model Parameters: 18219415 
Loading task weights from results/fluorescence_unirep_2019-07-30--17-22-15/task_weights.h5
Saving outputs to results/fluorescence_unirep_2019-07-30--17-22-15/outputs.pkl
37it [00:20,  2.00it/s, Loss=0.75, MAE=0.57]2019-07-30 19:48:39.226378: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-07-30 19:48:39.226439: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592

I'm running on a Tesla T4 with 14GB of memory (Google Colab).

The memory explosion would appear to be in
test_metrics = test_graph.run_epoch(save_outputs=outfile)

Any suggestions on how to resolve?

Thanks.

Scott

@thomas-a-neil
Copy link
Member

Hi Scott,

Thanks for reporting this bug. I've run into some memory issues when saving outputs as well (with the LSTM, which also uses a recurrent model like UniRep). We'll look into it.

I just wanted to note that it looks like you're running out of system RAM (not GPU RAM), which I believe on Google Colab is 26 GB.

@thomas-a-neil thomas-a-neil added the bug Something isn't working label Jul 30, 2019
@spark157
Copy link
Author

Hi,

I switched to using the Transformer to see if the memory issue is isolated to the UniRep model but unfortunately I got the same error although it made it much farther - 795it. [I'm not sure what the batch size is so don't know how much of the ~27k in the test set it got through.]

!tape-eval results/fluorescence_transformer_2019-07-31--12-20-58/ --datafile data/fluorescence/fluorescence_test.tfrecord

2019-07-31 13:55:07.712235: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
Transformer with Parameters:
	n_layers: 12
	n_heads: 8
	d_model: 512
	d_filter: 2048
	dropout: 0.1
Model Parameters: 38435840
Loading task weights from results/fluorescence_transformer_2019-07-31--12-20-58/task_weights.h5
Saving outputs to results/fluorescence_transformer_2019-07-31--12-20-58/outputs.pkl
795it [05:32,  2.44it/s, MAE=1.46, Loss=2.9]2019-07-31 14:00:59.735716: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-07-31 14:00:59.735787: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592
2019-07-31 14:00:59.735818: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 7730940928 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory

Thanks for the quick responses and looking into this.

Scott

@rmrao
Copy link
Collaborator

rmrao commented Jul 31, 2019

So this is an issue because we're really using save_outputs in a way that it was never intended to be used. What save_outputs does is save everything that the model returns. It does so by adding each batch to a list, then dumping everything into a pickle file at the end of the epoch. But our models by default return a lot of the inner state (e.g. the outputs of the encoder, etc), which save_outputs was never really designed to handle. Since it keeps everything in RAM instead of saving incrementally, you'll eventually get this problem regardless of what model you run if you run on a large enough dataset.

What we should do is change save_outputs to use an h5py file or something and save outputs incrementally instead of keeping everything in RAM.

@rmrao
Copy link
Collaborator

rmrao commented Jul 31, 2019

When writing the paper I just added a hack to delete the keys of the output that we didn't need, which is why we were able to actually get results. We need a more robust solution if we're going to expose this feature to other people.

@spark157
Copy link
Author

spark157 commented Jul 31, 2019

Would it be possible/make sense to be able to pass an argument such as: --save_outputs False when starting 'tape-eval'? (Currently it is hardcoded as: save_outputs: True in config_updates.)

I'm not sure how disabling the save_outputs would impact test_metrics = test_graph.run_epoch(save_outputs=outfile) but it looks like in rhinokeras it is an optional item and so maybe can be set to False as an option (handled as a case).

The above might be a quick fix in terms of getting it running (albeit without saving the outputs) until your better solution is implemented.

Scott

@rmrao
Copy link
Collaborator

rmrao commented Aug 25, 2019

@CaptainCapsaicin I'm not sure what the progress on the h5py option is. It's tricky because we don't want to write a non-general solution into rinokeras, which is used by other students in our lab. I see that there's an open pull request and some conversation with David. Maybe we can take a closer look when I get back next week.

As for writing a quick fix, we could probably do something like that. tape-eval will print the metrics, but won't actually save any outputs. This is mostly fine for some tasks (secondary structure, remote homology), not ideal for some tasks (contact prediction), and pretty bad for others (fluorescence, stability). Mostly the difference is the metrics that are calculated in tape-eval are calculated on-gpu in a per-batch basis which is a little tricky for some of the actual metrics we use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants