-
Notifications
You must be signed in to change notification settings - Fork 56
CVS-175980: Disable pre-allocation of memory for dynamic models through onnxruntime_perf_test #896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CVS-175980: Disable pre-allocation of memory for dynamic models through onnxruntime_perf_test #896
Conversation
b29a692 to
91c444c
Compare
adrianlizarraga
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks solid to me. Thank you for fixing it.
| bool has_dynamic_output = false; | ||
|
|
||
| for (size_t i = 0; i < session_.GetOutputCount(); ++i) { | ||
| auto type_info = session_.GetOutputTypeInfo(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: some minor indentation issues. I only point this out because it will be flagged in the main ORT repo CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. will get this resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes memory allocation issues for dynamic models when using device memory with the OpenVINO execution provider. Previously, the code pre-allocated output buffers by treating dynamic dimensions (-1) as 1, which caused crashes when users specified larger dimensions via -f or reshape_input options.
Key Changes:
- Detects dynamic dimensions in model outputs before pre-allocating device memory
- Skips pre-allocation when dynamic outputs are present, allowing runtime dimension resolution
- Preserves existing pre-allocation behavior for static models
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return Ort::Value(nullptr); | ||
| }; | ||
| } else { | ||
| new_value = [](OrtAllocator* allocator, const std::vector<int64_t>& output_shape, | ||
| Ort::ConstTensorTypeAndShapeInfo& tensor_info) { | ||
| return Ort::Value::CreateTensor(allocator, output_shape.data(), output_shape.size(), |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent indentation: the lambda body at line 1052 should be indented consistently with other lambda bodies in the same context (compare with lines 1023-1025 and 1055-1059).
| return Ort::Value(nullptr); | |
| }; | |
| } else { | |
| new_value = [](OrtAllocator* allocator, const std::vector<int64_t>& output_shape, | |
| Ort::ConstTensorTypeAndShapeInfo& tensor_info) { | |
| return Ort::Value::CreateTensor(allocator, output_shape.data(), output_shape.size(), | |
| return Ort::Value(nullptr); | |
| }; | |
| } else { | |
| new_value = [](OrtAllocator* allocator, const std::vector<int64_t>& output_shape, | |
| Ort::ConstTensorTypeAndShapeInfo& tensor_info) { | |
| return Ort::Value::CreateTensor(allocator, output_shape.data(), output_shape.size(), |
|
|
||
| auto transform_fcn = std::function<int64_t(int64_t)>(); | ||
| auto new_value = std::function<Ort::Value(OrtAllocator*, const std::vector<int64_t>&, Ort::ConstTensorTypeAndShapeInfo&)>(); | ||
| transform_fcn = [](int64_t input) { return input; }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The declaration of transform_fcn is just a few lines above. If the variable is unconditionally changed to an identity function I think it makes sense to instead remove the variable and skip the transform call below; it's not doing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@javier-intel .. transform_fcn is used in line1067
However, there is no conditional seting it anymore.
@adrianlizarraga .. Can comment if the transform function can be removed altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| auto transform_fcn = std::function<int64_t(int64_t)>(); | ||
| auto new_value = std::function<Ort::Value(OrtAllocator*, const std::vector<int64_t>&, Ort::ConstTensorTypeAndShapeInfo&)>(); | ||
| transform_fcn = [](int64_t input) { return input; }; | ||
| if (device_memory_name_.empty()) { | ||
| transform_fcn = [](int64_t input) { return input; }; | ||
| new_value = [](OrtAllocator*, const std::vector<int64_t>&, Ort::ConstTensorTypeAndShapeInfo&) { | ||
| return Ort::Value(nullptr); | ||
| }; | ||
| } else { | ||
| Ort::MemoryInfo memory_info(nullptr); // Default initialize, will be overwritten | ||
| if (device_memory_name_ == CUDA) { | ||
| memory_info = Ort::MemoryInfo(device_memory_name_.data(), OrtArenaAllocator, 0, OrtMemTypeDefault); | ||
| } else { | ||
| memory_info = Ort::MemoryInfo(device_memory_name_.data(), OrtArenaAllocator, 0, OrtMemTypeCPUOutput); | ||
| } | ||
| custom_allocator_ = Ort::Allocator(session_, memory_info); | ||
| // Switch to custom | ||
| allocator_ = Ort::UnownedAllocator(custom_allocator_); | ||
|
|
||
| // free dimensions are treated as 1 if not overridden | ||
| transform_fcn = [](int64_t input) { return (input == -1) ? -input : input; }; | ||
| new_value = [](OrtAllocator* allocator, const std::vector<int64_t>& output_shape, Ort::ConstTensorTypeAndShapeInfo& tensor_info) { | ||
| return Ort::Value::CreateTensor(allocator, output_shape.data(), output_shape.size(), tensor_info.GetElementType()); | ||
| }; | ||
| // Do not pre-allocate if dynamic dimensions are present | ||
| bool has_dynamic_output = false; | ||
|
|
||
| for (size_t i = 0; i < session_.GetOutputCount(); ++i) { | ||
| auto type_info = session_.GetOutputTypeInfo(i); | ||
| auto tensor_info = type_info.GetTensorTypeAndShapeInfo(); | ||
| auto shape = tensor_info.GetShape(); | ||
| if (std::any_of(shape.begin(), shape.end(), [](int64_t d) { return d == -1; })) { | ||
| has_dynamic_output = true; | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| if (has_dynamic_output) { | ||
| new_value = [](OrtAllocator*, const std::vector<int64_t>&, Ort::ConstTensorTypeAndShapeInfo&) { | ||
| return Ort::Value(nullptr); | ||
| }; | ||
| } else { | ||
| new_value = [](OrtAllocator* allocator, const std::vector<int64_t>& output_shape, | ||
| Ort::ConstTensorTypeAndShapeInfo& tensor_info) { | ||
| return Ort::Value::CreateTensor(allocator, output_shape.data(), output_shape.size(), | ||
| tensor_info.GetElementType()); | ||
| }; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd replace everything from line 1019-1069 with the following:
| if (!device_memory_name_.empty()) { | |
| Ort::MemoryInfo memory_info(nullptr); // Default initialize, will be overwritten | |
| if (device_memory_name_ == CUDA) { | |
| memory_info = Ort::MemoryInfo(device_memory_name_.data(), OrtArenaAllocator, 0, OrtMemTypeDefault); | |
| } else { | |
| memory_info = Ort::MemoryInfo(device_memory_name_.data(), OrtArenaAllocator, 0, OrtMemTypeCPUOutput); | |
| } | |
| custom_allocator_ = Ort::Allocator(session_, memory_info); | |
| // Switch to custom | |
| allocator_ = Ort::UnownedAllocator(custom_allocator_); | |
| } | |
| for (size_t i = 0; i < output_names_raw_ptr.size(); i++) { | |
| Ort::TypeInfo type_info = session_.GetOutputTypeInfo(i); | |
| auto tensor_info = type_info.GetTensorTypeAndShapeInfo(); | |
| std::vector<int64_t> output_shape = tensor_info.GetShape(); | |
| auto is_dynamic = std::find(shape.begin(), shape.end(), -1) != shape.end(); | |
| if (is_dynamic || device_memory_name_.empty()) { | |
| outputs_.emplace_back(Ort::Value(nullptr)); | |
| } else { | |
| auto & new_value = Ort::Value::CreateTensor(allocator_, output_shape.data(), output_shape.size(), tensor_info.GetElementType()); | |
| outputs_.emplace_back(std::move(new_value)); | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @javier-intel, I've updated the code to check each output and pre-allocate only the static ones as suggested.
fc9569c to
e46e233
Compare
cd6516f to
35e276d
Compare
Description
This PR disables pre-allocating output buffer memory on device for dynamic models. Now, the device_memory_name|OpenVINO_RT_NPU option can be used with the freeoverride dimension or the reshape input option as such:
onnxruntime_perf_test -e openvino -m duration -t 30 -o 0 -C "session.disable_cpu_ep_fallback|1" -i "device_type|NPU device_memory_name|OpenVINO_RT_NPU reshape_input|input[]"
or
onnxruntime_perf_test -e openvino -m duration -t 30 -o 0 -C "session.disable_cpu_ep_fallback|1" -I -f "dynamic_dimension:100" -i "device_type|NPU device_memory_name|OpenVINO_RT_NPU"
Motivation and Context
This PR addresses the following open-source issue: microsoft#26217
When using device memory allocation (using the flag- device_memory_name|OpenVINO_RT_NPU) with models containing dynamic dimensions, the current code pre-allocates output buffers by converting all -1 dimensions to 1. This prevents the freeoverride dimension (-f) and the reshape input option from working as required. For example, if users specify -f batch:5, the pre-allocated buffer (size=1) is too small for actual inference (size=5), leading to memory crashes.