-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM during training #1
Comments
To add more info: the output of "nvidia-smi" at the time of OOM: |
Hi @mingchen62, |
hi Guillaume I use Harvard data set (make build). Will dig more into this and report back if I find out anything. |
Wonder if it is to do with TF version. Saw this warning: |
I'm using tensorflow==1.4.1. |
I solve this problem by deleting all images whose size > 400*160(about 250 images) from training set. |
Thanks @luo3300612 , I met the same OOM problem, which happened in the same node "...rnn/att_mechanism/MatMul/...", i solved it by your advice.
|
got OOM when doing training:
Environment:
tf: 1.4
GPU: Titan X
python 2.7
Ubuntu 16.04
Error:
2018-01-07 22:12:42.933166: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[34560,1]
Traceback (most recent call last):
File "train.py", line 61, in
main()
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "train.py", line 57, in main
model.train(config, train_set, val_set, lr_schedule)
File "/home/hope/im2latex-1/model/base.py", line 160, in train
lr_schedule)
File "/home/hope/im2latex-1/model/img2seq.py", line 173, in _run_epoch
feed_dict=fd)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[34560,1]
[[Node: attn_cell/rnn/while/rnn/att_mechanism/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](attn_cell/rnn/while/rnn/att_mechanism/Reshape, attn_cell/rnn/while/rnn/att_mechanism/MatMul/Enter)]]
[[Node: Mean/_85 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2674_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Caused by op u'attn_cell/rnn/while/rnn/att_mechanism/MatMul', defined at:
File "train.py", line 61, in
main()
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "train.py", line 56, in main
model.build_train(config)
File "/home/hope/im2latex-1/model/img2seq.py", line 41, in build_train
self._add_pred_op()
File "/home/hope/im2latex-1/model/img2seq.py", line 119, in _add_pred_op
self.dropout)
File "/home/hope/im2latex-1/model/decoder.py", line 60, in call
initial_state=attn_cell.initial_state())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 614, in dynamic_rnn
dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 777, in _dynamic_rnn_loop
swap_memory=swap_memory)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2816, in while_loop
result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2640, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2590, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 762, in _time_step
(output, new_state) = call_cell()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 748, in
call_cell = lambda: cell(input_t, state)
File "/home/hope/im2latex-1/model/components/attention_cell.py", line 109, in call
new_output, new_state = self.step(inputs, state)
File "/home/hope/im2latex-1/model/components/attention_cell.py", line 79, in step
c = self._attention_mechanism.context(new_h)
File "/home/hope/im2latex-1/model/components/attention_mechanism.py", line 83, in context
e = tf.matmul(att_flat, att_beta)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1898, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 2437, in _mat_mul
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2960, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1473, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[34560,1]
[[Node: attn_cell/rnn/while/rnn/att_mechanism/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](attn_cell/rnn/while/rnn/att_mechanism/Reshape, attn_cell/rnn/while/rnn/att_mechanism/MatMul/Enter)]]
[[Node: Mean/_85 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2674_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
The text was updated successfully, but these errors were encountered: