Eval error: CUDA_ERROR_OUT_OF_MEMORY #8

spark157 · 2019-07-30T19:54:03Z

I have trained the fluorescence task model with:

!tape with model=unirep tasks=fluorescence load_from='pretrained_models/unirep_weights.h5' freeze_embedding_weights=True steps_per_epoch=100 datafile='data/fluorescence/fluorescence_train.tfrecords'

[Note used very small steps_per_epoch so it would train in a reasonable time so I can just get something working.]

Next I tried to evaluate the model using:

!tape-eval results/fluorescence_unirep_2019-07-30--17-22-15/ --datafile data/fluorescence/fluorescence_test.tfrecord'

but after only a few iterations the GPU memory use just explodes:

Model Parameters: 18219415 
Loading task weights from results/fluorescence_unirep_2019-07-30--17-22-15/task_weights.h5
Saving outputs to results/fluorescence_unirep_2019-07-30--17-22-15/outputs.pkl
37it [00:20,  2.00it/s, Loss=0.75, MAE=0.57]2019-07-30 19:48:39.226378: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-07-30 19:48:39.226439: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592

I'm running on a Tesla T4 with 14GB of memory (Google Colab).

The memory explosion would appear to be in
test_metrics = test_graph.run_epoch(save_outputs=outfile)

Any suggestions on how to resolve?

Thanks.

Scott

The text was updated successfully, but these errors were encountered:

thomas-a-neil · 2019-07-30T22:40:49Z

Hi Scott,

Thanks for reporting this bug. I've run into some memory issues when saving outputs as well (with the LSTM, which also uses a recurrent model like UniRep). We'll look into it.

I just wanted to note that it looks like you're running out of system RAM (not GPU RAM), which I believe on Google Colab is 26 GB.

spark157 · 2019-07-31T14:08:37Z

Hi,

I switched to using the Transformer to see if the memory issue is isolated to the UniRep model but unfortunately I got the same error although it made it much farther - 795it. [I'm not sure what the batch size is so don't know how much of the ~27k in the test set it got through.]

!tape-eval results/fluorescence_transformer_2019-07-31--12-20-58/ --datafile data/fluorescence/fluorescence_test.tfrecord

2019-07-31 13:55:07.712235: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
Transformer with Parameters:
	n_layers: 12
	n_heads: 8
	d_model: 512
	d_filter: 2048
	dropout: 0.1
Model Parameters: 38435840
Loading task weights from results/fluorescence_transformer_2019-07-31--12-20-58/task_weights.h5
Saving outputs to results/fluorescence_transformer_2019-07-31--12-20-58/outputs.pkl
795it [05:32,  2.44it/s, MAE=1.46, Loss=2.9]2019-07-31 14:00:59.735716: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-07-31 14:00:59.735787: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592
2019-07-31 14:00:59.735818: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 7730940928 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory

Thanks for the quick responses and looking into this.

Scott

rmrao · 2019-07-31T20:23:05Z

So this is an issue because we're really using save_outputs in a way that it was never intended to be used. What save_outputs does is save everything that the model returns. It does so by adding each batch to a list, then dumping everything into a pickle file at the end of the epoch. But our models by default return a lot of the inner state (e.g. the outputs of the encoder, etc), which save_outputs was never really designed to handle. Since it keeps everything in RAM instead of saving incrementally, you'll eventually get this problem regardless of what model you run if you run on a large enough dataset.

What we should do is change save_outputs to use an h5py file or something and save outputs incrementally instead of keeping everything in RAM.

rmrao · 2019-07-31T20:25:24Z

When writing the paper I just added a hack to delete the keys of the output that we didn't need, which is why we were able to actually get results. We need a more robust solution if we're going to expose this feature to other people.

spark157 · 2019-07-31T21:36:34Z

Would it be possible/make sense to be able to pass an argument such as: --save_outputs False when starting 'tape-eval'? (Currently it is hardcoded as: save_outputs: True in config_updates.)

I'm not sure how disabling the save_outputs would impact test_metrics = test_graph.run_epoch(save_outputs=outfile) but it looks like in rhinokeras it is an optional item and so maybe can be set to False as an option (handled as a case).

The above might be a quick fix in terms of getting it running (albeit without saving the outputs) until your better solution is implemented.

Scott

rmrao · 2019-08-25T23:03:41Z

@CaptainCapsaicin I'm not sure what the progress on the h5py option is. It's tricky because we don't want to write a non-general solution into rinokeras, which is used by other students in our lab. I see that there's an open pull request and some conversation with David. Maybe we can take a closer look when I get back next week.

As for writing a quick fix, we could probably do something like that. tape-eval will print the metrics, but won't actually save any outputs. This is mostly fine for some tasks (secondary structure, remote homology), not ideal for some tasks (contact prediction), and pretty bad for others (fluorescence, stability). Mostly the difference is the metrics that are calculated in tape-eval are calculated on-gpu in a per-batch basis which is a little tricky for some of the actual metrics we use.

thomas-a-neil added the bug Something isn't working label Jul 30, 2019

thomas-a-neil mentioned this issue Aug 8, 2019

Use h5py for output data writing and consolidation to reduce memory footprint #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval error: CUDA_ERROR_OUT_OF_MEMORY #8

Eval error: CUDA_ERROR_OUT_OF_MEMORY #8

spark157 commented Jul 30, 2019 •

edited

Loading

thomas-a-neil commented Jul 30, 2019

spark157 commented Jul 31, 2019

rmrao commented Jul 31, 2019

rmrao commented Jul 31, 2019

spark157 commented Jul 31, 2019 •

edited

Loading

rmrao commented Aug 25, 2019

Eval error: CUDA_ERROR_OUT_OF_MEMORY #8

Eval error: CUDA_ERROR_OUT_OF_MEMORY #8

Comments

spark157 commented Jul 30, 2019 • edited Loading

thomas-a-neil commented Jul 30, 2019

spark157 commented Jul 31, 2019

rmrao commented Jul 31, 2019

rmrao commented Jul 31, 2019

spark157 commented Jul 31, 2019 • edited Loading

rmrao commented Aug 25, 2019

spark157 commented Jul 30, 2019 •

edited

Loading

spark157 commented Jul 31, 2019 •

edited

Loading