GPU memory allocator for multiple cuda stream #12919
Unanswered
Joeyzhouqihui
asked this question in
Other Q&A
Replies: 1 comment 1 reply
-
Multi-stream support is in the works. cc @souptc |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, sorry for bothering!
I am trying to deploy a model with dynamic connection of our company to production environment. Since the model is dynamically activated, so batching requests and do inference is not a good idea. Instead, I want to use multiple cuda streams to deal with several requests on one GPU concurrently. (one stream for one request)
I have tried libtorch, since it support multiple stream. However, I found that, when using libtorch, the memory allocated by each stream will be cached by that stream, which cannot be reused by other stream. (Suppose there are 2G memory on one GPU, and stream A caches 1G after deal with request1. Now stream B want to deal with request2, stream A will have to first return the memory back to OS, and stream B needs to call cuda malloc, which is very slow.)
I am wondering whether the same thing will happen with onnxruntime? Can different streams in onnxruntime reuse cached gpu memory?
I am looking forward for your reply! Thank you so much!
Beta Was this translation helpful? Give feedback.
All reactions