-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Add enable_profiling in runoptions #26846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds per-run profiling capability to ONNX Runtime by introducing enable_profiling and profile_file_prefix options to RunOptions. This allows users to enable profiling for individual inference runs independent of session-level profiling, providing more granular control over performance analysis.
Key changes:
- Added
enable_profilingandprofile_file_prefixfields to RunOptions structure - Modified execution providers to accept an
enable_profilingparameter inGetProfiler()method - Enhanced timestamp formatting to include milliseconds for more precise profiling file naming
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| include/onnxruntime/core/framework/run_options.h | Added enable_profiling flag and profile_file_prefix configuration |
| onnxruntime/python/onnxruntime_pybind_state.cc | Exposed new profiling options to Python API |
| onnxruntime/core/session/inference_session.cc | Implemented run-level profiler creation, initialization, and lifecycle management |
| include/onnxruntime/core/framework/execution_provider.h | Updated GetProfiler signature to accept enable_profiling parameter |
| onnxruntime/core/providers/cuda/cuda_execution_provider.h/cc | Updated GetProfiler implementation for CUDA provider |
| onnxruntime/core/providers/vitisai/vitisai_execution_provider.h/cc | Updated GetProfiler implementation for VitisAI provider |
| onnxruntime/core/providers/webgpu/webgpu_execution_provider.h/cc | Implemented session vs run profiler separation using thread_local storage |
| onnxruntime/core/providers/webgpu/webgpu_context.h/cc | Added profiler registration/unregistration and multi-profiler event collection |
| onnxruntime/core/providers/webgpu/webgpu_profiler.cc | Updated to register/unregister with context and handle event collection |
| onnxruntime/core/common/profiler.h/cc | Added overloaded Start and EndTimeAndRecordEvent methods accepting explicit timestamps |
| onnxruntime/core/framework/utils.h/cc | Propagated run_profiler parameter through execution graph functions |
| onnxruntime/core/framework/sequential_executor.h/cc | Added run_profiler support in SessionScope and KernelScope for dual profiling |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
828938d to
0022eb0
Compare
1fa65ff to
978b59a
Compare
978b59a to
c48efdb
Compare
| run_options.only_execute_path_to_fetches); | ||
| run_options.only_execute_path_to_fetches, | ||
| nullptr, | ||
| run_profiler); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a number of things being passed from RunOptions here. Can we modify the signature in a way that a reference to RunOptions is being passed?
Then we can instantiate the profiler higher in the stack, inside ExecuteGraph?
I can see that RunOptions are being passed in one of the overloads, that seems sensible.
| concurrency::ThreadPool::StopProfiling(session_state_.GetThreadPool())}, | ||
| }); | ||
|
|
||
| std::initializer_list<std::pair<std::string, std::string>> event_args = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (session_state_.Profiler().IsEnabled()) { | ||
| session_start_ = session_state.Profiler().Start(); | ||
| bool session_profiling_enabled = session_state_.Profiler().IsEnabled(); | ||
| bool run_profiling_enabled = run_profiler_ && run_profiler_->IsEnabled(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| { | ||
| if (session_state_.Profiler().IsEnabled()) { | ||
| session_start_ = session_state.Profiler().Start(); | ||
| bool session_profiling_enabled = session_state_.Profiler().IsEnabled(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| { | ||
| if (session_state_.Profiler().IsEnabled()) { | ||
| session_start_ = session_state.Profiler().Start(); | ||
| bool session_profiling_enabled = session_state_.Profiler().IsEnabled(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not convinced that we should allow both profilers to run in parallel.
Do you have a use case for that? What would be the purpose to collect the same data?
If someone wants continuous profiling, would it not be the same thing as running it with RunOptons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This depends on how we want to handle the case when both run-level and session-level profiling are enabled.
For example, when a user calls Session::Run with both run-level and session-level profiling enabled, there will be two profilers active: a local run_profiler and the session_profiler_ owned by InferenceSession. The current implementation guarantees that two JSON files are generated, and that the events recorded in the run-level profiling output are a strict subset of those in the session-level profiling output.
In this scenario, each operator execution generates two identical profiling events: one is recorded by the session-level profiler, and the other is recorded by the run-level profiler.
| event_args); | ||
| events.emplace_back(std::move(event)); | ||
|
|
||
| // Distribute the event to all WebGPU EP profilers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me say the
case 1: session-level: ON, run-level: OFF
session1 = InferenceSession(enable_profiling = true /*session-level*/)
thread1: session1.Run(enable_profiling=false /*run-level*/)
There are one Profiler instance, one WebGpuProfiler instance(is owned by Profiler). So here the number of profilers is one. It always use session-level profiler, i.g. session_profiler_ of InfereceSession.
case 2: session-level: OFF, run-level: ON
session1 = InferenceSession(enable_profiling = false /*session-level*/)
session1.Run(enable_profiling=true /*run-level*/)
There are one Profiler instance, one WebGpuProfiler instance(is owned by Profiler). So here the number of profilers is one. It always use run-level profiler, i.g. local variable profiler.
case 3: session-level: ON, run-level: ON
session1 = InferenceSession(enable_profiling = true /*session-level*/)
session1.Run(enable_profiling=true /*run-level*/)
There are two Profiler instance, two WebGpuProfiler instance(is owned by Profiler). So here the number of profilers is two. It always use run-level profiler.
case 4: session-level: ON, run-level: two threads ON, one thread OFF
session1 = InferenceSession(enable_profiling = true /*session-level*/)
thread1: session1.Run(enable_profiling=true /*run-level*/)
thread2: session1.Run(enable_profiling=true /*run-level*/)
thread3: session1.Run(enable_profiling=false /*run-level*/)
There are three Profiler instances and three WebGpuProfiler instances(are owned by Profiler). Because webgpu ep doesn't support concurrent run yet, so when the number of profilers in this CollectProfilingData is two: one is session-level profiler and another one is one of the two EP profilers(which one is used is determined by the current thread during Run)
| profilers.push_back(session_profiler_); | ||
| } | ||
|
|
||
| if (run_options.enable_profiling && tls_run_profiler_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two types of profiling: CPU time, collected by the Profiler, and GPU time, collected by EP profilers (e.g., the WebGPU profiler). Here I have no better way to get correct WebGPU Profiler of current running thread for the above case mentioned before(case 4: session-level: ON, run-level: two threads ON, one thread OFF), so I stored it in TLS.
Oh, I just thought of a possible solution. Do you think we could make Profiler a member of RunOptions, instead of a local variable in InferenceSession::Run?
Before
Status WebGpuExecutionProvider::OnRunEnd(bool /* sync_stream */, const onnxruntime::RunOptions& run_options) {
...
if (run_options.enable_profiling && tls_run_profiler_) {
if (tls_run_profiler_->Enabled()) {
profilers.push_back(tls_run_profiler_);
}
tls_run_profiler_ = nullptr;
}
if (!profilers.empty()) {
context_.CollectProfilingData(profilers);
}
After
Status WebGpuExecutionProvider::OnRunEnd(bool /* sync_stream */, const onnxruntime::RunOptions& run_options) {
...
if (run_options.enable_profiling && tls_run_profiler_) {
if (tls_run_profiler_->Enabled()) {
profilers.push_back(run_options.run_profiler);
}
if (!profilers.empty()) {
context_.CollectProfilingData(profilers);
}
}
``
| // The actual filename will be: <profile_file_prefix>_<timestamp>.json | ||
| // Only used when enable_profiling is true. | ||
| std::string profile_file_prefix = "onnxruntime_run_profile"; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need C and C++ API and Python comes after that.
And we need tests
yuslepukhin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕐
Description
Support run-level profiling
This PR adds support for profiling individual Run executions, similar to session-level profiling. Developers can enable run-level profiling by setting
enable_profilingandprofile_file_prefixin RunOptions. Once the run completes, a JSON profiling file will be saved using profile_file_prefix + timestamp.Key Changes
run_profilerinInferenceSession::Run, which is destroyed after the run completes. Using a dedicated profiler per run ensures that profiling data is isolated and prevents interleaving or corruption across runs.StartandEndTimeAndRecordEventfunctions have been added. These allow the caller to provide timestamps instead of relying onstd::chrono::high_resolution_clock::now(), avoiding potential timing inaccuracies.tls_run_profiler_to support run-level profiling with WebGPU Execution Provider (EP). This ensures that when multiple threads enable run-level profiling, each thread logs only to its own WebGPU profiler, keeping thread-specific data isolated.HH:MM:SS.mminstead ofHH:MM:SSin the JSON filename to prevent conflicts when profiling multiple consecutive runs.Motivation and Context
Previously, profiling only for session level. Sometimes developer want to profile for specfic run . so the PR comes.
Some details
When profiling is enabled via RunOptions, it should ideally collect two types of events:
Used to calculate the CPU execution time of each operator.
Used to measure GPU kernel execution time.
Unlike session-level, we need to ensure the collecting events is correct for multiple thread scenario.
For 1, this can be supported easily(sequential_executor.cc). We use a thread-local storage (TLS) variable, RunLevelState (defined in profiler.h), to maintain run-level profiling state for each thread.
For 2, each Execution Provider (EP) has its own profiler implementation, and each EP must ensure correct behavior under run-level profiling. This PR ensures that the WebGPU profiler works correctly with run-level profiling.
Test Cases
sess1.Run({ enable_profiling: true })t2:
sess1.Run({ enable_profiling: false })t3:
sess1.Run({ enable_profiling: true })t1and one fort3.sess1 = OrtSession({ enable_profiling: true })sess1.Run({ enable_profiling: true })