GPU memory allocator for multiple cuda stream #12919

Joeyzhouqihui · 2022-09-10T14:09:05Z

Joeyzhouqihui
Sep 10, 2022

Hi, sorry for bothering!

I am trying to deploy a model with dynamic connection of our company to production environment. Since the model is dynamically activated, so batching requests and do inference is not a good idea. Instead, I want to use multiple cuda streams to deal with several requests on one GPU concurrently. (one stream for one request)

I have tried libtorch, since it support multiple stream. However, I found that, when using libtorch, the memory allocated by each stream will be cached by that stream, which cannot be reused by other stream. (Suppose there are 2G memory on one GPU, and stream A caches 1G after deal with request1. Now stream B want to deal with request2, stream A will have to first return the memory back to OS, and stream B needs to call cuda malloc, which is very slow.)

I am wondering whether the same thing will happen with onnxruntime? Can different streams in onnxruntime reuse cached gpu memory?

I am looking forward for your reply! Thank you so much!

pranavsharma · 2022-09-23T23:29:30Z

pranavsharma
Sep 23, 2022

Multi-stream support is in the works. cc @souptc

1 reply

souptc Sep 24, 2022
Collaborator

yes we are looking at the multi-stream support, we have a PR for this: #12227 , it is almost done, we have evaluated it in most cuda scenarios, do you mind to have a try?

@jslhcl and @RandySheriff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory allocator for multiple cuda stream #12919

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

GPU memory allocator for multiple cuda stream #12919

Joeyzhouqihui Sep 10, 2022

Replies: 1 comment · 1 reply

pranavsharma Sep 23, 2022

souptc Sep 24, 2022 Collaborator

Joeyzhouqihui
Sep 10, 2022

Replies: 1 comment 1 reply

pranavsharma
Sep 23, 2022

souptc Sep 24, 2022
Collaborator