Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM during training #1

Open
mingchen62 opened this issue Jan 8, 2018 · 7 comments
Open

OOM during training #1

mingchen62 opened this issue Jan 8, 2018 · 7 comments

Comments

@mingchen62
Copy link

got OOM when doing training:

Environment:
tf: 1.4
GPU: Titan X
python 2.7
Ubuntu 16.04

Error:
2018-01-07 22:12:42.933166: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[34560,1]
Traceback (most recent call last):
File "train.py", line 61, in
main()
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "train.py", line 57, in main
model.train(config, train_set, val_set, lr_schedule)
File "/home/hope/im2latex-1/model/base.py", line 160, in train
lr_schedule)
File "/home/hope/im2latex-1/model/img2seq.py", line 173, in _run_epoch
feed_dict=fd)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[34560,1]
[[Node: attn_cell/rnn/while/rnn/att_mechanism/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](attn_cell/rnn/while/rnn/att_mechanism/Reshape, attn_cell/rnn/while/rnn/att_mechanism/MatMul/Enter)]]
[[Node: Mean/_85 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2674_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op u'attn_cell/rnn/while/rnn/att_mechanism/MatMul', defined at:
File "train.py", line 61, in
main()
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "train.py", line 56, in main
model.build_train(config)
File "/home/hope/im2latex-1/model/img2seq.py", line 41, in build_train
self._add_pred_op()
File "/home/hope/im2latex-1/model/img2seq.py", line 119, in _add_pred_op
self.dropout)
File "/home/hope/im2latex-1/model/decoder.py", line 60, in call
initial_state=attn_cell.initial_state())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 614, in dynamic_rnn
dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 777, in _dynamic_rnn_loop
swap_memory=swap_memory)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2816, in while_loop
result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2640, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2590, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 762, in _time_step
(output, new_state) = call_cell()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 748, in
call_cell = lambda: cell(input_t, state)
File "/home/hope/im2latex-1/model/components/attention_cell.py", line 109, in call
new_output, new_state = self.step(inputs, state)
File "/home/hope/im2latex-1/model/components/attention_cell.py", line 79, in step
c = self._attention_mechanism.context(new_h)
File "/home/hope/im2latex-1/model/components/attention_mechanism.py", line 83, in context
e = tf.matmul(att_flat, att_beta)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1898, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 2437, in _mat_mul
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2960, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1473, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[34560,1]
[[Node: attn_cell/rnn/while/rnn/att_mechanism/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](attn_cell/rnn/while/rnn/att_mechanism/Reshape, attn_cell/rnn/while/rnn/att_mechanism/MatMul/Enter)]]
[[Node: Mean/_85 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2674_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

@mingchen62
Copy link
Author

To add more info:

the output of "nvidia-smi" at the time of OOM:
total 48 G GPU Memory were all used.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 00000000:05:00.0 On | N/A |
| 45% 73C P2 66W / 250W | 11762MiB / 12188MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) Off | 00000000:06:00.0 Off | N/A |
| 23% 33C P8 16W / 250W | 11588MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN X (Pascal) Off | 00000000:09:00.0 Off | N/A |
| 23% 31C P8 16W / 250W | 11588MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN X (Pascal) Off | 00000000:0A:00.0 Off | N/A |
| 23% 20C P8 15W / 250W | 11588MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

@guillaumegenthial
Copy link
Owner

Hi @mingchen62,
It seems like the input image was really huge (after convolutions and flattening, got shape 34560 which I assume is batched so you just need to divide by the batch size).
Did you use the harvard dataset? There might be some problem here. (check the shapes?)
Otherwise, I remember having some OOM issue with a broken install of tensorflow due to undesired ubuntu updates.
Depending on the GPU used, could happen that you have to lower the batch size... but 48gbs seems more than enough.
Keep me updated if you find the fix / origin of your problem!
Cheers,
Guillaume

@mingchen62
Copy link
Author

hi Guillaume
thanks for great blog and GitHub repo.
I tried a smaller batch size (4) and be able to run training.
( I do agree 48 G memory shall be enough for a decent batch size of 20).

I use Harvard data set (make build).

Will dig more into this and report back if I find out anything.

@mingchen62
Copy link
Author

Wonder if it is to do with TF version.
@guillaumegenthial are you using a different tf version than 1.4.0-rc0?

Saw this warning:
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

@guillaumegenthial
Copy link
Owner

I'm using tensorflow==1.4.1.
This warning is expected and shouldn't be a problem.
Have you found anything weird about shapes?

@luo3300612
Copy link

I solve this problem by deleting all images whose size > 400*160(about 250 images) from training set.

@tnkong
Copy link

tnkong commented May 13, 2019

Thanks @luo3300612 , I met the same OOM problem, which happened in the same node "...rnn/att_mechanism/MatMul/...", i solved it by your advice.
modified method _procedd_instance in model/utils/data_generator.py:
.....
img = imread(self._dir_images + "/" + img_path) # 使用scipy中imread的读取图片
img_shape = np.shape(img)
area = img_shape[0] * img_shape[1]
max_area = 400 * 160
img = self._img_prepro(img)
....
if area > max_area:
skip = True

    return inst, skip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants