Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Training] RuntimeError: gradient_builder_base.h:123 onnxruntime::training::ArgDef onnxruntime::training::GradientBuilderBase::O(size_t, bool) const i < node_->OutputDefs().size() was false #22955

Open
jagadish-amd opened this issue Nov 27, 2024 · 1 comment
Labels
ep:ROCm questions/issues related to ROCm execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. training issues related to ONNX Runtime training; typically submitted using template

Comments

@jagadish-amd
Copy link
Contributor

Describe the issue

**Runtime error before training starts. **
Traceback (most recent call last):
File "/workspace/optimum/./examples/onnxruntime/training/language-modeling/run_clm.py", line 671, in
main()
File "/workspace/optimum/./examples/onnxruntime/training/language-modeling/run_clm.py", line 618, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/workspace/optimum/optimum/onnxruntime/trainer.py", line 408, in train
return inner_training_loop(
File "/workspace/optimum/optimum/onnxruntime/trainer.py", line 734, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/trainer.py", line 3485, in training_step
loss = self.compute_loss(model, inputs)
File "/workspace/optimum/optimum/onnxruntime/trainer.py", line 301, in compute_loss
return super().compute_loss(model_with_loss, inputs, return_outputs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/trainer.py", line 3532, in compute_loss
outputs = model(**inputs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 823, in forward
return model_forward(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 811, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_utils.py", line 388, in _forward
return ortmodule._torch_module.forward(*inputs, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_utils.py", line 368, in _forward
return torch_module_ort._execution_manager(torch_module_ort.is_training()).forward(*inputs, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 326, in forward
self._fallback_manager.handle_exception(
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_fallback.py", line 157, in handle_exception
raise exception
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 268, in forward
self._build_graph(graph_transformer_config)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_logger.py", line 161, in wrapper
result = func(graph_execution_manager, *args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 341, in _build_graph
super()._build_graph(graph_transformer_config)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 182, in _build_graph
self.graph_builder.build(config)
RuntimeError: /workspace/onnxruntime/orttraining/orttraining/core/graph/gradient_builder_base.h:123 onnxruntime::training::ArgDef onnxruntime::training::GradientBuilderBase::O(size_t, bool) const i < node
->OutputDefs().size() was false.

The assertion error is on the operator "/_original_module/transformer/h.0/attn/Dropout_output_0"
onnx rt module expects 2 output layer but there is only one.
/_original_module/transformer/h.0/attn/Dropout_output_0

The issue is not observed when attention dropout is not used.
config.attn_pdrop = 0 at https://github.com/huggingface/optimum/blob/main/examples/onnxruntime/training/language-modeling/run_clm.py#L435

To reproduce

Steps to to repro the issue:
Clone https://github.com/huggingface/optimum
python ./examples/onnxruntime/training/language-modeling/run_clm.py --model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --label_smoothing 0.1 --max_steps 150 --logging_steps 1 --logging_dir log --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --output_dir output --overwrite_output_dir --skip_memory_metrics --fp16 --do_train --do_eval

Urgency

No response

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

09d2ee6

PyTorch Version

2.3

Execution Provider

ROCm

Execution Provider Library Version

ROCm 6.2

@jagadish-amd jagadish-amd added the training issues related to ONNX Runtime training; typically submitted using template label Nov 27, 2024
@github-actions github-actions bot added ep:ROCm questions/issues related to ROCm execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. labels Nov 27, 2024
@jagadish-amd
Copy link
Contributor Author

We are not sure if the condition check failure is due to the optimum/transformers training config mismatch or bug in onnx rt training module.
Appreciate if someone can help us here.

This issue occurs on ROCm. Training is not yet started, so should be EP agnostic issue.

cc @jeffdaily @Rohan138

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:ROCm questions/issues related to ROCm execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. training issues related to ONNX Runtime training; typically submitted using template
Projects
None yet
Development

No branches or pull requests

1 participant