Running multiple parallel inferences on the same model using the CUDA provider. #7876

xgirones · 2021-05-28T09:00:13Z

xgirones
May 28, 2021

I am using a tiny model for video filtering. To achieve better performance, I launch parallel inferences for a group of video frames. Since the CUDA provider appears to serialize concurrent calls to the same session, I create a pool of sessions of the same model. So far, this approach is working fine, but my concern is that I am wasting GPU resources due to static session state replication. Is there a way to create clones of a session? I am aware of the new 'CreateSessionWithPrepackedWeightsContainer' API call, but I do not how to use it to share all the static data of a session. What is the most efficient way to run parallel inferences on the same model using the CUDA provider?

Answered by snnn

Jun 3, 2021

Since the CUDA provider appears to serialize concurrent calls to the same session

If that's true, we should fix this. Do you have more details? Do you know which part of code restricts the concurrency?

View full answer

snnn · 2021-06-03T07:45:46Z

snnn
Jun 3, 2021
Collaborator

Since the CUDA provider appears to serialize concurrent calls to the same session

If that's true, we should fix this. Do you have more details? Do you know which part of code restricts the concurrency?

1 reply

xgirones Jun 3, 2021
Author

Thanks for your answer. Today I compiled a recent version of the ORT and ran several tests aiming to find the bottleneck, and the issue seems to be gone. I apologize for the false alarm.

Months ago I was experiencing performance problems when running multiple parallel inferences on the same CUDA session. As a workaround, I created a pool of session objects for the same model, so my question was more about how I could get the different sessions to share static data. But this is no longer necessary because in the version I have tested today, one session per model is enough.

pranavsharma · 2021-06-03T07:52:32Z

pranavsharma
Jun 3, 2021
Maintainer

Just fyi. CreateSessionWithPrepackedWeightsContainer has nothing to do with CUDA.

2 replies

xgirones Jun 3, 2021
Author

Thanks for the clarification. I imagine it works for all providers, but I was wondering if I could use it to allow CUDA sessions to share static data.

pranavsharma Jun 3, 2021
Maintainer

No, it doesn't work for all providers. It's limited to CPU only so far.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running multiple parallel inferences on the same model using the CUDA provider. #7876

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Running multiple parallel inferences on the same model using the CUDA provider. #7876

xgirones May 28, 2021

Replies: 2 comments · 3 replies

snnn Jun 3, 2021 Collaborator

xgirones Jun 3, 2021 Author

pranavsharma Jun 3, 2021 Maintainer

xgirones Jun 3, 2021 Author

pranavsharma Jun 3, 2021 Maintainer

xgirones
May 28, 2021

Replies: 2 comments 3 replies

snnn
Jun 3, 2021
Collaborator

xgirones Jun 3, 2021
Author

pranavsharma
Jun 3, 2021
Maintainer

xgirones Jun 3, 2021
Author

pranavsharma Jun 3, 2021
Maintainer