-
I am using a tiny model for video filtering. To achieve better performance, I launch parallel inferences for a group of video frames. Since the CUDA provider appears to serialize concurrent calls to the same session, I create a pool of sessions of the same model. So far, this approach is working fine, but my concern is that I am wasting GPU resources due to static session state replication. Is there a way to create clones of a session? I am aware of the new 'CreateSessionWithPrepackedWeightsContainer' API call, but I do not how to use it to share all the static data of a session. What is the most efficient way to run parallel inferences on the same model using the CUDA provider? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
If that's true, we should fix this. Do you have more details? Do you know which part of code restricts the concurrency? |
Beta Was this translation helpful? Give feedback.
-
Just fyi. CreateSessionWithPrepackedWeightsContainer has nothing to do with CUDA. |
Beta Was this translation helpful? Give feedback.
If that's true, we should fix this. Do you have more details? Do you know which part of code restricts the concurrency?