OOM during training #1

mingchen62 · 2018-01-08T03:20:01Z

got OOM when doing training:

Environment:
tf: 1.4
GPU: Titan X
python 2.7
Ubuntu 16.04

Error:
2018-01-07 22:12:42.933166: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[34560,1]
Traceback (most recent call last):
File "train.py", line 61, in
main()
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "train.py", line 57, in main
model.train(config, train_set, val_set, lr_schedule)
File "/home/hope/im2latex-1/model/base.py", line 160, in train
lr_schedule)
File "/home/hope/im2latex-1/model/img2seq.py", line 173, in _run_epoch
feed_dict=fd)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[34560,1]
[[Node: attn_cell/rnn/while/rnn/att_mechanism/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](attn_cell/rnn/while/rnn/att_mechanism/Reshape, attn_cell/rnn/while/rnn/att_mechanism/MatMul/Enter)]]
[[Node: Mean/_85 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2674_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op u'attn_cell/rnn/while/rnn/att_mechanism/MatMul', defined at:
File "train.py", line 61, in
main()
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "train.py", line 56, in main
model.build_train(config)
File "/home/hope/im2latex-1/model/img2seq.py", line 41, in build_train
self._add_pred_op()
File "/home/hope/im2latex-1/model/img2seq.py", line 119, in _add_pred_op
self.dropout)
File "/home/hope/im2latex-1/model/decoder.py", line 60, in call
initial_state=attn_cell.initial_state())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 614, in dynamic_rnn
dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 777, in _dynamic_rnn_loop
swap_memory=swap_memory)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2816, in while_loop
result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2640, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2590, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 762, in _time_step
(output, new_state) = call_cell()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 748, in
call_cell = lambda: cell(input_t, state)
File "/home/hope/im2latex-1/model/components/attention_cell.py", line 109, in call
new_output, new_state = self.step(inputs, state)
File "/home/hope/im2latex-1/model/components/attention_cell.py", line 79, in step
c = self._attention_mechanism.context(new_h)
File "/home/hope/im2latex-1/model/components/attention_mechanism.py", line 83, in context
e = tf.matmul(att_flat, att_beta)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1898, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 2437, in _mat_mul
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2960, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1473, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[34560,1]
[[Node: attn_cell/rnn/while/rnn/att_mechanism/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](attn_cell/rnn/while/rnn/att_mechanism/Reshape, attn_cell/rnn/while/rnn/att_mechanism/MatMul/Enter)]]
[[Node: Mean/_85 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2674_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

mingchen62 · 2018-01-08T16:04:07Z

To add more info:

guillaumegenthial · 2018-01-10T00:45:37Z

Hi @mingchen62,
It seems like the input image was really huge (after convolutions and flattening, got shape 34560 which I assume is batched so you just need to divide by the batch size).
Did you use the harvard dataset? There might be some problem here. (check the shapes?)
Otherwise, I remember having some OOM issue with a broken install of tensorflow due to undesired ubuntu updates.
Depending on the GPU used, could happen that you have to lower the batch size... but 48gbs seems more than enough.
Keep me updated if you find the fix / origin of your problem!
Cheers,
Guillaume

mingchen62 · 2018-01-12T02:54:09Z

hi Guillaume
thanks for great blog and GitHub repo.
I tried a smaller batch size (4) and be able to run training.
( I do agree 48 G memory shall be enough for a decent batch size of 20).

I use Harvard data set (make build).

Will dig more into this and report back if I find out anything.

mingchen62 · 2018-01-12T16:20:39Z

Wonder if it is to do with TF version.
@guillaumegenthial are you using a different tf version than 1.4.0-rc0?

Saw this warning:
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

guillaumegenthial · 2018-01-16T05:34:06Z

I'm using tensorflow==1.4.1.
This warning is expected and shouldn't be a problem.
Have you found anything weird about shapes?

luo3300612 · 2019-01-17T14:13:33Z

I solve this problem by deleting all images whose size > 400*160(about 250 images) from training set.

tnkong · 2019-05-13T07:22:14Z

Thanks @luo3300612 , I met the same OOM problem, which happened in the same node "...rnn/att_mechanism/MatMul/...", i solved it by your advice.
modified method _procedd_instance in model/utils/data_generator.py:
.....
img = imread(self._dir_images + "/" + img_path) # 使用scipy中imread的读取图片
img_shape = np.shape(img)
area = img_shape[0] * img_shape[1]
max_area = 400 * 160
img = self._img_prepro(img)
....
if area > max_area:
skip = True

    return inst, skip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM during training #1

OOM during training #1

mingchen62 commented Jan 8, 2018

mingchen62 commented Jan 8, 2018

guillaumegenthial commented Jan 10, 2018

mingchen62 commented Jan 12, 2018

mingchen62 commented Jan 12, 2018

guillaumegenthial commented Jan 16, 2018

luo3300612 commented Jan 17, 2019

tnkong commented May 13, 2019

OOM during training #1

OOM during training #1

Comments

mingchen62 commented Jan 8, 2018

mingchen62 commented Jan 8, 2018

guillaumegenthial commented Jan 10, 2018

mingchen62 commented Jan 12, 2018

mingchen62 commented Jan 12, 2018

guillaumegenthial commented Jan 16, 2018

luo3300612 commented Jan 17, 2019

tnkong commented May 13, 2019