-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Cuda in Graphics Implementation for TensorRT backend #100
base: main
Are you sure you want to change the base?
Conversation
@nv-kmcgill53 Please review this as discussed |
CMakeLists.txt
Outdated
@@ -269,6 +269,7 @@ target_link_libraries( | |||
triton-tensorrt-backend | |||
PRIVATE | |||
CUDA::cudart | |||
CUDA::cuda_driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nv-kmcgill53 , @mc-nv, @tanmayv25 - any issues with this dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the context behind adding this dependency?
From documentation is see:
CUDA Driver Library
The CUDA Driver library (cuda) are used by applications that use calls such ascuMemAlloc
, andcuMemFree
.
Targets Created:
CUDA::cuda_driver
Aren't this dependency is requisite of TensorRT itself?
Thought by default our product expect driver to be installed and if GPU capability given then available for usage including driver targets and binaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this dependency should be fine. Ashish is linking correctly according to the cuda documentation. As it states
Context management can be done through the driver API, but is not exposed in the runtime API
So they will need to link against the driver instead of just linking against the cuda runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So they will need to link against the driver instead of just linking against the cuda runtime.
I'm not agree with this statement, current linkage doesn't explain why user want to add it explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cmake documentation isn't exhaustive when it mentions cuMemAlloc
and cuMemFree
. The user in this case is using the Driver API to set/pass the cuda context around in the backend, rather than letting the core
take care of this. This is the reason for adding the CUDA::cuda_driver
lib to the linking path. This PR necessarily makes use of functions in the driver where the trt_backend didn't before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Triton TensorRT Backend is unable to work without CUDA, Triton Inference Server and TensorRT installation.
Current change, per my understanding, uses only cudaSetDevice
(CUDA::cudart) and cudaGetErrorString
(CUDA runtime API) and those dependencies are satisfied. There why I don't see any reason to link against CUDA::cuda_driver
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The additional dependency is required for cuCtxPushCurrent() / cuCtxPopCurrent() @ https://github.com/triton-inference-server/tensorrt_backend/pull/100/files#diff-3137e95db7c97b9ddd71fa1b600e3dd646025c5a872ad94d4d09f04313fe3fcdR66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we need this dependency is because we are using a special context call Cuda in Graphics (CiG) context which has to work with the cuda driver dll for its operations.
@ashishk98 - can you install and run pre-commit checks locally?
|
@nnshah1 fixed pre-commit |
src/tensorrt_model.h
Outdated
//! for gaming use case. Creating a shared contexts reduces context switching | ||
//! overhead and leads to better performance of model execution along side | ||
//! Graphics workload. | ||
CUcontext GetCiGContext() { return cig_ctx_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashishk98 question: is this specific to CIG - or could be applied to any application provided cuda context?
CMakeLists.txt
Outdated
target_compile_definitions( | ||
triton-tensorrt-backend | ||
PRIVATE TRITON_ENABLE_CIG | ||
) | ||
target_link_libraries( | ||
triton-tensorrt-backend | ||
PRIVATE | ||
CUDA::cuda_driver | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These setting could be achieved with generator expression, isn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a generator expression?
src/tensorrt_model.cc
Outdated
triton::common::TritonJson::Value value; | ||
std::string ptr_value; | ||
if (parameters.Find("CIG_CONTEXT_PTR", &value)) { | ||
RETURN_IF_ERROR(value.MemberAsString("string_value", &ptr_value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashishk98 instead of directly converting here as a special case, I would prefer to use something similar to what is done in the trt-llm backend:
In this case there is a template method to convert from a parameter to a value - I think the code will be a little clearer to follow.
Also - can we convert to and from a 64 bit integer?
so something like:
model_state->GetParameter<uint64>("CUDA_CONTEXT");
Also it strikes me that although we use value.MemberAsString()
we could also use value.MemberAsUint("string_value",&ptr_value)
So two things - 1) add a templated GetParameter() method and 2) we can use MemberAsUint for the uint64 template. 3) officially transfer uint64 values and convert them to and from context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a GetParameter call for std::string instead of UINT64. This is because when we add the parameter to model config it is directly converted into a hex string instead of a numeric string. Hence while parsing the pointer, MemberAsUint fails because it gets a hex string to parse.
src/model_state.cc
Outdated
.c_str()); | ||
#ifdef TRITON_ENABLE_CIG | ||
// Set device if CiG is disabled | ||
if (!isCiGEnabled()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tanmayv25, @ashishk98 - is there a way to have a single scoped object
ScopedCudaDeviceContext
That internally checks if there is an application_context
and if there is an application context uses push / pop - if not uses cudaSetDevice
?
We don't currently use them in the same locations - but am wondering if that would be possible - I think it would be cleaner logically - where basically an 'application_context' takes the place of the 'device' but otherwise the logic remains the same.
ScopedObject(Device);
ScopedObject(Context);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can take a look at this in the next iteration
src/model_state.cc
Outdated
@@ -175,7 +175,13 @@ ModelState::ModelState(TRITONBACKEND_Model* triton_model) | |||
ModelState::~ModelState() | |||
{ | |||
for (auto& device_engine : device_engines_) { | |||
cudaSetDevice(device_engine.first.first); | |||
#ifdef TRITON_ENABLE_CIG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I had asked for looking at macros to enable, but I would like to avoid this kind of guard - if we can use a single method and then have two different implementations of that method / object would prefer that to having the macros embedded in the functions / methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
I would like to see:
|
if (!model_state_->isCiGEnabled()) | ||
#endif // TRITON_ENABLE_CIG | ||
{ | ||
cudaSetDevice(DeviceId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind to share the reasoning of avoiding the set device calls? Wouldn't that cause the issue of model not being placed / executed on selected device (based on model config)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The intended use of cuda context sharing is targeted only of single GPU (RTX end-user) systems. I wanted to avoid complications with this use case
- When we call cudaSetDevice() the cuda runtime resets the to using the default cuda context for the thread
How model instances on multiple GPUs will be handled? AFAIK a cuda context is per device. If we have more than one GPU devices, then we should pass cuda context handles for each GPU device. Or am I missing something here? |
@ashishk98 I believe we still need to raise an error if someone tries to use pre-built cuda context with multi-GPU environment, right? |
Can the stakeholders provide another round of reviews on this PR? We'd like to get these changes into a release asset this week. |
Add cuda context sharing support for TensorRT backend to reduce context switching overhead when graphics workload is running in parallel