Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Cuda in Graphics Implementation for TensorRT backend #100

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -269,6 +269,7 @@ target_link_libraries(
triton-tensorrt-backend
PRIVATE
CUDA::cudart
CUDA::cuda_driver
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nv-kmcgill53 , @mc-nv, @tanmayv25 - any issues with this dependency

Copy link
Contributor

@mc-nv mc-nv Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the context behind adding this dependency?
From documentation is see:

CUDA Driver Library
The CUDA Driver library (cuda) are used by applications that use calls such as cuMemAlloc, and cuMemFree.
Targets Created:
CUDA::cuda_driver

Aren't this dependency is requisite of TensorRT itself?
Thought by default our product expect driver to be installed and if GPU capability given then available for usage including driver targets and binaries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this dependency should be fine. Ashish is linking correctly according to the cuda documentation. As it states

Context management can be done through the driver API, but is not exposed in the runtime API

So they will need to link against the driver instead of just linking against the cuda runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So they will need to link against the driver instead of just linking against the cuda runtime.

I'm not agree with this statement, current linkage doesn't explain why user want to add it explicitly.

Copy link
Contributor

@nv-kmcgill53 nv-kmcgill53 Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cmake documentation isn't exhaustive when it mentions cuMemAlloc and cuMemFree. The user in this case is using the Driver API to set/pass the cuda context around in the backend, rather than letting the core take care of this. This is the reason for adding the CUDA::cuda_driver lib to the linking path. This PR necessarily makes use of functions in the driver where the trt_backend didn't before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triton TensorRT Backend is unable to work without CUDA, Triton Inference Server and TensorRT installation.
Current change, per my understanding, uses only cudaSetDevice (CUDA::cudart) and cudaGetErrorString (CUDA runtime API) and those dependencies are satisfied. There why I don't see any reason to link against CUDA::cuda_driver.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we need this dependency is because we are using a special context call Cuda in Graphics (CiG) context which has to work with the cuda driver dll for its operations.

)


Expand Down
28 changes: 19 additions & 9 deletions src/instance_state.cc
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,10 @@ ModelInstanceState::ModelInstanceState(

ModelInstanceState::~ModelInstanceState()
{
cudaSetDevice(DeviceId());
// Set device if CiG is disabled
if (!model_state_->isCiGEnabled()) {
cudaSetDevice(DeviceId());
}
for (auto& io_binding_infos : io_binding_infos_) {
for (auto& io_binding_info : io_binding_infos) {
if (!io_binding_info.IsDynamicShapeOutput() &&
Expand Down Expand Up @@ -424,7 +427,10 @@ ModelInstanceState::Run(
payload_.reset(new Payload(next_set_, requests, request_count));
SET_TIMESTAMP(payload_->compute_start_ns_);

cudaSetDevice(DeviceId());
// Set device if CiG is disabled
if (!model_state_->isCiGEnabled()) {
cudaSetDevice(DeviceId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind to share the reasoning of avoiding the set device calls? Wouldn't that cause the issue of model not being placed / executed on selected device (based on model config)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The intended use of cuda context sharing is targeted only of single GPU (RTX end-user) systems. I wanted to avoid complications with this use case
  2. When we call cudaSetDevice() the cuda runtime resets the to using the default cuda context for the thread

}
#ifdef TRITON_ENABLE_STATS
{
SET_TIMESTAMP(payload_->compute_start_ns_);
Expand Down Expand Up @@ -1551,13 +1557,17 @@ ModelInstanceState::EvaluateTensorRTContext(
TRITONSERVER_Error*
ModelInstanceState::InitStreamsAndEvents()
{
// Set the device before preparing the context.
auto cuerr = cudaSetDevice(DeviceId());
if (cuerr != cudaSuccess) {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INTERNAL, (std::string("unable to set device for ") +
Name() + ": " + cudaGetErrorString(cuerr))
.c_str());
// Set device if CiG is disabled
if (!model_state_->isCiGEnabled()) {
// Set the device before preparing the context.
auto cuerr = cudaSetDevice(DeviceId());
if (cuerr != cudaSuccess) {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INTERNAL,
(std::string("unable to set device for ") + Name() + ": " +
cudaGetErrorString(cuerr))
.c_str());
}
}

// Create CUDA streams associated with the instance
Expand Down
57 changes: 34 additions & 23 deletions src/model_state.cc
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,10 @@ ModelState::ModelState(TRITONBACKEND_Model* triton_model)
ModelState::~ModelState()
{
for (auto& device_engine : device_engines_) {
cudaSetDevice(device_engine.first.first);
// Set device if CiG is disabled
if (!isCiGEnabled()) {
cudaSetDevice(device_engine.first.first);
}
auto& runtime = device_engine.second.first;
auto& engine = device_engine.second.second;
// Need to reset explicitly to ensure proper destruction order
Expand Down Expand Up @@ -209,15 +212,17 @@ ModelState::CreateEngine(
// We share the engine (for models that don't have dynamic shapes) and
// runtime across instances that have access to the same GPU/NVDLA.
if (eit->second.second == nullptr) {
auto cuerr = cudaSetDevice(gpu_device);
if (cuerr != cudaSuccess) {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INTERNAL,
(std::string("unable to set device for ") + Name() + ": " +
cudaGetErrorString(cuerr))
.c_str());
// Set device if CiG is disabled
if (!isCiGEnabled()) {
auto cuerr = cudaSetDevice(gpu_device);
if (cuerr != cudaSuccess) {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INTERNAL,
(std::string("unable to set device for ") + Name() + ": " +
cudaGetErrorString(cuerr))
.c_str());
}
}

const bool new_runtime = (eit->second.first == nullptr);
RETURN_IF_ERROR(LoadPlan(
model_path, dla_core_id, &eit->second.first, &eit->second.second,
Expand Down Expand Up @@ -321,13 +326,16 @@ ModelState::AutoCompleteConfig()
" to auto-complete config for " + Name())
.c_str()));

cuerr = cudaSetDevice(device_id);
if (cuerr != cudaSuccess) {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INTERNAL,
(std::string("unable to set CUDA device to GPU ") +
std::to_string(device_id) + " : " + cudaGetErrorString(cuerr))
.c_str());
// Set device if CiG is disabled
if (!isCiGEnabled()) {
cuerr = cudaSetDevice(device_id);
if (cuerr != cudaSuccess) {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INTERNAL,
(std::string("unable to set CUDA device to GPU ") +
std::to_string(device_id) + " : " + cudaGetErrorString(cuerr))
.c_str());
}
}

std::string artifact_name;
Expand Down Expand Up @@ -373,13 +381,16 @@ ModelState::AutoCompleteConfig()

RETURN_IF_ERROR(AutoCompleteConfigHelper(model_path));

cuerr = cudaSetDevice(current_device);
if (cuerr != cudaSuccess) {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INTERNAL,
(std::string("unable to revert CUDA device to GPU ") +
std::to_string(current_device) + " : " + cudaGetErrorString(cuerr))
.c_str());
// Set device if CiG is disabled
if (!isCiGEnabled()) {
cuerr = cudaSetDevice(current_device);
if (cuerr != cudaSuccess) {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INTERNAL,
(std::string("unable to revert CUDA device to GPU ") +
std::to_string(current_device) + " : " + cudaGetErrorString(cuerr))
.c_str());
}
}

if (TRITONSERVER_LogIsEnabled(TRITONSERVER_LOG_VERBOSE)) {
Expand Down
10 changes: 9 additions & 1 deletion src/tensorrt.cc
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,7 @@ TRITONBACKEND_ModelInstanceInitialize(TRITONBACKEND_ModelInstance* instance)
DeviceMemoryTracker::TrackThreadMemoryUsage(lusage.get());
}

ScopedRuntimeCiGContext cig_scope(model_state);

// With each instance we create a ModelInstanceState object and
// associate it with the TRITONBACKEND_ModelInstance.
Expand Down Expand Up @@ -349,10 +350,15 @@ TRITONBACKEND_ModelInstanceFinalize(TRITONBACKEND_ModelInstance* instance)
RETURN_IF_ERROR(TRITONBACKEND_ModelInstanceState(instance, &vstate));
ModelInstanceState* instance_state =
reinterpret_cast<ModelInstanceState*>(vstate);

LOG_MESSAGE(
TRITONSERVER_LOG_INFO,
"TRITONBACKEND_ModelInstanceFinalize: delete instance state");
if (!instance_state)
{
return nullptr;
}
ScopedRuntimeCiGContext cig_scope(instance_state->StateForModel());

delete instance_state;

Expand All @@ -377,6 +383,8 @@ TRITONBACKEND_ModelInstanceExecute(
instance, reinterpret_cast<void**>(&instance_state)));
ModelState* model_state = instance_state->StateForModel();

ScopedRuntimeCiGContext cig_scope(instance_state->StateForModel());

// For TensorRT backend, the executing instance may not closely tie to
// TRITONBACKEND_ModelInstance, the instance will be assigned based on
// execution policy.
Expand Down
18 changes: 16 additions & 2 deletions src/tensorrt_model.cc
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

#include "tensorrt_model.h"
#include <sstream>

namespace triton { namespace backend { namespace tensorrt {

Expand Down Expand Up @@ -53,7 +54,7 @@ TensorRTModel::TensorRTModel(TRITONBACKEND_Model* triton_model)
: BackendModel(triton_model), priority_(Priority::DEFAULT),
use_cuda_graphs_(false), gather_kernel_buffer_threshold_(0),
separate_output_stream_(false), eager_batching_(false),
busy_wait_events_(false)
busy_wait_events_(false), cig_ctx_(nullptr)
{
ParseModelConfig();
}
Expand Down Expand Up @@ -89,7 +90,20 @@ TensorRTModel::ParseModelConfig()
cuda.MemberAsBool("output_copy_stream", &separate_output_stream_));
}
}

triton::common::TritonJson::Value parameters;
if (model_config_.Find("parameters", &parameters)) {
triton::common::TritonJson::Value value;
std::string ptr_value;
if (parameters.Find("CIG_CONTEXT_PTR", &value)) {
RETURN_IF_ERROR(value.MemberAsString("string_value", &ptr_value));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashishk98 instead of directly converting here as a special case, I would prefer to use something similar to what is done in the trt-llm backend:

https://github.com/triton-inference-server/tensorrtllm_backend/blob/8ffb174c0fe88e677eeed7928348e20be548f3f6/inflight_batcher_llm/src/model_state.cc#L204

In this case there is a template method to convert from a parameter to a value - I think the code will be a little clearer to follow.

Also - can we convert to and from a 64 bit integer?

so something like:

model_state->GetParameter<uint64>("CUDA_CONTEXT");

Also it strikes me that although we use value.MemberAsString()

we could also use value.MemberAsUint("string_value",&ptr_value)

Instead (https://github.com/triton-inference-server/common/blob/578491fc3944f77d16a6a38e3d7691c485c47ba0/include/triton/common/triton_json.h#L927)

So two things - 1) add a templated GetParameter() method and 2) we can use MemberAsUint for the uint64 template. 3) officially transfer uint64 values and convert them to and from context.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a GetParameter call for std::string instead of UINT64. This is because when we add the parameter to model config it is directly converted into a hex string instead of a numeric string. Hence while parsing the pointer, MemberAsUint fails because it gets a hex string to parse.

std::stringstream ss;
ss << ptr_value;
void* ctx_ptr;
ss >> ctx_ptr;
cig_ctx_ = static_cast<CUcontext>(ctx_ptr);
LOG_MESSAGE(TRITONSERVER_LOG_VERBOSE, "CiG Context pointer is set");
}
}
return nullptr; // Success
}

Expand Down
52 changes: 52 additions & 0 deletions src/tensorrt_model.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
#pragma once

#include "triton/backend/backend_model.h"
#include <cuda.h>

namespace triton { namespace backend { namespace tensorrt {

Expand Down Expand Up @@ -53,6 +54,39 @@ class TensorRTModel : public BackendModel {
bool EagerBatching() { return eager_batching_; }
bool BusyWaitEvents() { return busy_wait_events_; }


//! Following functions are related to CiG (Cuda in Graphics) context sharing for
//! gaming use case. Creating a shared contexts reduces context switching overhead
//! and leads to better performance of model execution along side Graphics workload.
CUcontext GetCiGContext() { return cig_ctx_; }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashishk98 question: is this specific to CIG - or could be applied to any application provided cuda context?

bool isCiGEnabled() { return cig_ctx_ != nullptr; }

inline TRITONSERVER_Error* PushCiGContext()
{
if (CUDA_SUCCESS != cuCtxPushCurrent(cig_ctx_)) {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INTERNAL,
(std::string("unable to push CiG context for ") + Name()).c_str());
}
return nullptr;
}

inline TRITONSERVER_Error* PopCiGContext()
{
CUcontext oldCtx{};
if (CUDA_SUCCESS != cuCtxPopCurrent(&oldCtx)) {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INTERNAL,
(std::string("unable to [pop CiG context for ") + Name()).c_str());
}
if (oldCtx != cig_ctx_) {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INTERNAL,
(std::string("popping the wrong CiG context for ") + Name()).c_str());
}
return nullptr;
}

protected:
common::TritonJson::Value graph_specs_;
Priority priority_;
Expand All @@ -61,6 +95,24 @@ class TensorRTModel : public BackendModel {
bool separate_output_stream_;
bool eager_batching_;
bool busy_wait_events_;
CUcontext cig_ctx_;
};

struct ScopedRuntimeCiGContext {
ScopedRuntimeCiGContext(TensorRTModel* model_state)
: model_state_(model_state)
{
if (model_state_->isCiGEnabled()) {
THROW_IF_BACKEND_MODEL_ERROR(model_state_->PushCiGContext());
}
}
~ScopedRuntimeCiGContext()
{
if (model_state_->isCiGEnabled()) {
THROW_IF_BACKEND_MODEL_ERROR(model_state_->PopCiGContext());
}
}
TensorRTModel* model_state_;
};

}}} // namespace triton::backend::tensorrt
Loading