Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Cuda in Graphics Implementation for TensorRT backend #100

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

ashishk98
Copy link

Add cuda context sharing support for TensorRT backend to reduce context switching overhead when graphics workload is running in parallel

@ashishk98
Copy link
Author

ashishk98 commented Sep 5, 2024

@nv-kmcgill53 Please review this as discussed

CMakeLists.txt Outdated
@@ -269,6 +269,7 @@ target_link_libraries(
triton-tensorrt-backend
PRIVATE
CUDA::cudart
CUDA::cuda_driver
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nv-kmcgill53 , @mc-nv, @tanmayv25 - any issues with this dependency

Copy link
Contributor

@mc-nv mc-nv Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the context behind adding this dependency?
From documentation is see:

CUDA Driver Library
The CUDA Driver library (cuda) are used by applications that use calls such as cuMemAlloc, and cuMemFree.
Targets Created:
CUDA::cuda_driver

Aren't this dependency is requisite of TensorRT itself?
Thought by default our product expect driver to be installed and if GPU capability given then available for usage including driver targets and binaries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this dependency should be fine. Ashish is linking correctly according to the cuda documentation. As it states

Context management can be done through the driver API, but is not exposed in the runtime API

So they will need to link against the driver instead of just linking against the cuda runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So they will need to link against the driver instead of just linking against the cuda runtime.

I'm not agree with this statement, current linkage doesn't explain why user want to add it explicitly.

Copy link
Contributor

@nv-kmcgill53 nv-kmcgill53 Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cmake documentation isn't exhaustive when it mentions cuMemAlloc and cuMemFree. The user in this case is using the Driver API to set/pass the cuda context around in the backend, rather than letting the core take care of this. This is the reason for adding the CUDA::cuda_driver lib to the linking path. This PR necessarily makes use of functions in the driver where the trt_backend didn't before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triton TensorRT Backend is unable to work without CUDA, Triton Inference Server and TensorRT installation.
Current change, per my understanding, uses only cudaSetDevice (CUDA::cudart) and cudaGetErrorString (CUDA runtime API) and those dependencies are satisfied. There why I don't see any reason to link against CUDA::cuda_driver.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we need this dependency is because we are using a special context call Cuda in Graphics (CiG) context which has to work with the cuda driver dll for its operations.

@nnshah1
Copy link

nnshah1 commented Sep 18, 2024

@ashishk98 - can you install and run pre-commit checks locally?

cd repo; pre-commit install; pre-commit run --all

@ashishk98
Copy link
Author

@nnshah1 fixed pre-commit

@ashishk98
Copy link
Author

@nnshah1 @mc-nv I have added a new cmake option TRITON_ENABLE_CIG which only conditionally enables the CiG code path as well as conditionally links the cuda_driver component of cuda

//! for gaming use case. Creating a shared contexts reduces context switching
//! overhead and leads to better performance of model execution along side
//! Graphics workload.
CUcontext GetCiGContext() { return cig_ctx_; }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashishk98 question: is this specific to CIG - or could be applied to any application provided cuda context?

CMakeLists.txt Outdated
Comment on lines 277 to 285
target_compile_definitions(
triton-tensorrt-backend
PRIVATE TRITON_ENABLE_CIG
)
target_link_libraries(
triton-tensorrt-backend
PRIVATE
CUDA::cuda_driver
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These setting could be achieved with generator expression, isn't?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a generator expression?

triton::common::TritonJson::Value value;
std::string ptr_value;
if (parameters.Find("CIG_CONTEXT_PTR", &value)) {
RETURN_IF_ERROR(value.MemberAsString("string_value", &ptr_value));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashishk98 instead of directly converting here as a special case, I would prefer to use something similar to what is done in the trt-llm backend:

https://github.com/triton-inference-server/tensorrtllm_backend/blob/8ffb174c0fe88e677eeed7928348e20be548f3f6/inflight_batcher_llm/src/model_state.cc#L204

In this case there is a template method to convert from a parameter to a value - I think the code will be a little clearer to follow.

Also - can we convert to and from a 64 bit integer?

so something like:

model_state->GetParameter<uint64>("CUDA_CONTEXT");

Also it strikes me that although we use value.MemberAsString()

we could also use value.MemberAsUint("string_value",&ptr_value)

Instead (https://github.com/triton-inference-server/common/blob/578491fc3944f77d16a6a38e3d7691c485c47ba0/include/triton/common/triton_json.h#L927)

So two things - 1) add a templated GetParameter() method and 2) we can use MemberAsUint for the uint64 template. 3) officially transfer uint64 values and convert them to and from context.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a GetParameter call for std::string instead of UINT64. This is because when we add the parameter to model config it is directly converted into a hex string instead of a numeric string. Hence while parsing the pointer, MemberAsUint fails because it gets a hex string to parse.

.c_str());
#ifdef TRITON_ENABLE_CIG
// Set device if CiG is disabled
if (!isCiGEnabled())
Copy link

@nnshah1 nnshah1 Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanmayv25, @ashishk98 - is there a way to have a single scoped object

ScopedCudaDeviceContext

That internally checks if there is an application_context and if there is an application context uses push / pop - if not uses cudaSetDevice ?

We don't currently use them in the same locations - but am wondering if that would be possible - I think it would be cleaner logically - where basically an 'application_context' takes the place of the 'device' but otherwise the logic remains the same.

    ScopedObject(Device);
    ScopedObject(Context);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can take a look at this in the next iteration

@@ -175,7 +175,13 @@ ModelState::ModelState(TRITONBACKEND_Model* triton_model)
ModelState::~ModelState()
{
for (auto& device_engine : device_engines_) {
cudaSetDevice(device_engine.first.first);
#ifdef TRITON_ENABLE_CIG
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I had asked for looking at macros to enable, but I would like to avoid this kind of guard - if we can use a single method and then have two different implementations of that method / object would prefer that to having the macros embedded in the functions / methods.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@nnshah1
Copy link

nnshah1 commented Sep 25, 2024

@tanmayv25 , @ashishk98

I would like to see:

  1. Can we generalize this to: "TRITON_ENABLE_APPLICATION_CONTEXT" instead of CIG - as it looks to be general for any cuda context?

  2. Can we use a more generic GetParameter<uint64> templated method similar to tensorrtllm backend. For the future we may consider adding this to the backend apis as it seems like a common pattern. small optimization here could be to use MemberAsUint() instead of MemberAsString in the template

  3. Can we create a method / function that would selectively use setdevice or push/pop but otherwise be the same logic? We can then enable the alternate constructor / check in one place instead of multiple.

if (!model_state_->isCiGEnabled())
#endif // TRITON_ENABLE_CIG
{
cudaSetDevice(DeviceId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind to share the reasoning of avoiding the set device calls? Wouldn't that cause the issue of model not being placed / executed on selected device (based on model config)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The intended use of cuda context sharing is targeted only of single GPU (RTX end-user) systems. I wanted to avoid complications with this use case
  2. When we call cudaSetDevice() the cuda runtime resets the to using the default cuda context for the thread

@tanmayv25
Copy link
Contributor

tanmayv25 commented Sep 25, 2024

How model instances on multiple GPUs will be handled? AFAIK a cuda context is per device. If we have more than one GPU devices, then we should pass cuda context handles for each GPU device. Or am I missing something here?

@tanmayv25
Copy link
Contributor

@ashishk98 I believe we still need to raise an error if someone tries to use pre-built cuda context with multi-GPU environment, right?

src/tensorrt_model.cc Outdated Show resolved Hide resolved
src/tensorrt_model.cc Outdated Show resolved Hide resolved
src/tensorrt_model.h Outdated Show resolved Hide resolved
@fpetrini15
Copy link
Contributor

Can the stakeholders provide another round of reviews on this PR? We'd like to get these changes into a release asset this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

8 participants