Skip to content

Commit d146e81

Browse files
author
pytorchbot
committed
2024-11-22 nightly release (1b40bcc)
1 parent 672d371 commit d146e81

File tree

6 files changed

+423
-62
lines changed

6 files changed

+423
-62
lines changed

docsrc/tutorials/serving_torch_tensorrt_with_triton.rst

+83-62
Original file line numberDiff line numberDiff line change
@@ -22,42 +22,55 @@ Step 1: Optimize your model with Torch-TensorRT
2222
Most Torch-TensorRT users will be familiar with this step. For the purpose of
2323
this demonstration, we will be using a ResNet50 model from Torchhub.
2424

25-
Let’s first pull the `NGC PyTorch Docker container <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch>`__. You may need to create
25+
We will be working in the ``//examples/triton`` directory which contains the scripts used in this tutorial.
26+
27+
First pull the `NGC PyTorch Docker container <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch>`__. You may need to create
2628
an account and get the API key from `here <https://ngc.nvidia.com/setup/>`__.
2729
Sign up and login with your key (follow the instructions
2830
`here <https://ngc.nvidia.com/setup/api-key>`__ after signing up).
2931

3032
::
3133

32-
# <xx.xx> is the yy:mm for the publishing tag for NVIDIA's Pytorch
33-
# container; eg. 22.04
34+
# YY.MM is the yy:mm for the publishing tag for NVIDIA's Pytorch
35+
# container; eg. 24.08
36+
# NOTE: Use the publishing tag for both the PyTorch container and the Triton Containers
3437

35-
docker run -it --gpus all -v ${PWD}:/scratch_space nvcr.io/nvidia/pytorch:<xx.xx>-py3
38+
docker run -it --gpus all -v ${PWD}:/scratch_space nvcr.io/nvidia/pytorch:YY.MM-py3
3639
cd /scratch_space
3740

38-
Once inside the container, we can proceed to download a ResNet model from
39-
Torchhub and optimize it with Torch-TensorRT.
41+
With the container we can export the model in to the correct directory in our Triton model repository. This export script uses the **Dynamo** frontend for Torch-TensorRT to compile the PyTorch model to TensorRT. Then we save the model using **TorchScript** as a serialization format which is supported by Triton.
4042

4143
::
4244

43-
import torch
44-
import torch_tensorrt
45-
torch.hub._validate_not_a_forked_repo=lambda a,b,c: True
45+
import torch
46+
import torch_tensorrt as torchtrt
47+
import torchvision
48+
49+
import torch
50+
import torch_tensorrt
51+
torch.hub._validate_not_a_forked_repo=lambda a,b,c: True
52+
53+
# load model
54+
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True).eval().to("cuda")
55+
56+
# Compile with Torch TensorRT;
57+
trt_model = torch_tensorrt.compile(model,
58+
inputs= [torch_tensorrt.Input((1, 3, 224, 224))],
59+
enabled_precisions= {torch_tensorrt.dtype.f16}
60+
)
61+
62+
ts_trt_model = torch.jit.trace(trt_model, torch.rand(1, 3, 224, 224).to("cuda"))
63+
64+
# Save the model
65+
torch.jit.save(ts_trt_model, "/triton_example/model_repository/resnet50/1/model.pt")
4666

47-
# load model
48-
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True).eval().to("cuda")
67+
You can run the script with the following command (from ``//examples/triton``)
4968

50-
# Compile with Torch TensorRT;
51-
trt_model = torch_tensorrt.compile(model,
52-
inputs= [torch_tensorrt.Input((1, 3, 224, 224))],
53-
enabled_precisions= { torch.half} # Run with FP32
54-
)
69+
::
5570

56-
# Save the model
57-
torch.jit.save(trt_model, "model.pt")
71+
docker run --gpus all -it --rm -v ${PWD}:/triton_example nvcr.io/nvidia/pytorch:YY.MM-py3 python /triton_example/export.py
5872

59-
After copying the model, exit the container. The next step in the process
60-
is to set up a Triton Inference Server.
73+
This will save the serialized TorchScript version of the ResNet model in the right directory in the model repository.
6174

6275
Step 2: Set Up Triton Inference Server
6376
--------------------------------------
@@ -90,25 +103,23 @@ For the model we prepared in step 1, the following configuration can be used:
90103

91104
::
92105

93-
name: "resnet50"
94-
platform: "pytorch_libtorch"
95-
max_batch_size : 0
96-
input [
97-
{
98-
name: "input__0"
99-
data_type: TYPE_FP32
100-
dims: [ 3, 224, 224 ]
101-
reshape { shape: [ 1, 3, 224, 224 ] }
102-
}
103-
]
104-
output [
105-
{
106-
name: "output__0"
107-
data_type: TYPE_FP32
108-
dims: [ 1, 1000 ,1, 1]
109-
reshape { shape: [ 1, 1000 ] }
110-
}
111-
]
106+
name: "resnet50"
107+
backend: "pytorch"
108+
max_batch_size : 0
109+
input [
110+
{
111+
name: "x"
112+
data_type: TYPE_FP32
113+
dims: [ 1, 3, 224, 224 ]
114+
}
115+
]
116+
output [
117+
{
118+
name: "output0"
119+
data_type: TYPE_FP32
120+
dims: [1, 1000]
121+
}
122+
]
112123

113124
The ``config.pbtxt`` file is used to describe the exact model configuration
114125
with details like the names and shapes of the input and output layer(s),
@@ -124,14 +135,14 @@ with the docker command below. Refer `this page <https://catalog.ngc.nvidia.com/
124135

125136
# Make sure that the TensorRT version in the Triton container
126137
# and TensorRT version in the environment used to optimize the model
127-
# are the same.
138+
# are the same. Roughly, like publishing tags should have the same TensorRT version
128139

129-
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /full/path/to/the_model_repository/model_repository:/models nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models
140+
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v ${PWD}:/triton_example nvcr.io/nvidia/tritonserver:YY.MM-py3 tritonserver --model-repository=/triton_example/model_repository
130141

131142
This should spin up a Triton Inference server. Next step, building a simple
132143
http client to query the server.
133144

134-
Step 3: Building a Triton Client to Query the Server
145+
Step 3: Building a Triton Client to Query the Servers
135146
----------------------------------------------------
136147

137148
Before proceeding, make sure to have a sample image on hand. If you don't
@@ -159,22 +170,24 @@ resize and normalize the query image.
159170

160171
::
161172

162-
import numpy as np
163-
from torchvision import transforms
164-
from PIL import Image
165-
import tritonclient.http as httpclient
166-
from tritonclient.utils import triton_to_np_dtype
167-
168-
# preprocessing function
169-
def rn50_preprocess(img_path="img1.jpg"):
170-
img = Image.open(img_path)
171-
preprocess = transforms.Compose([
172-
transforms.Resize(256),
173-
transforms.CenterCrop(224),
174-
transforms.ToTensor(),
175-
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
176-
])
177-
return preprocess(img).numpy()
173+
import numpy as np
174+
from torchvision import transforms
175+
from PIL import Image
176+
import tritonclient.http as httpclient
177+
from tritonclient.utils import triton_to_np_dtype
178+
179+
# preprocessing function
180+
def rn50_preprocess(img_path="/triton_example/img1.jpg"):
181+
img = Image.open(img_path)
182+
preprocess = transforms.Compose(
183+
[
184+
transforms.Resize(256),
185+
transforms.CenterCrop(224),
186+
transforms.ToTensor(),
187+
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
188+
]
189+
)
190+
return preprocess(img).unsqueeze(0).numpy()
178191

179192
transformed_img = rn50_preprocess()
180193

@@ -186,22 +199,22 @@ with the Triton Inference Server.
186199
# Setting up client
187200
client = httpclient.InferenceServerClient(url="localhost:8000")
188201

189-
Secondly, we specify the names of the input and output layer(s) of our model.
202+
Secondly, we specify the names of the input and output layer(s) of our model. This can be obtained during export and should already be specified in your ``config.pbtxt``
190203

191204
::
192205

193-
inputs = httpclient.InferInput("input__0", transformed_img.shape, datatype="FP32")
206+
inputs = httpclient.InferInput("x", transformed_img.shape, datatype="FP32")
194207
inputs.set_data_from_numpy(transformed_img, binary_data=True)
195208

196-
outputs = httpclient.InferRequestedOutput("output__0", binary_data=True, class_count=1000)
209+
outputs = httpclient.InferRequestedOutput("output0", binary_data=True, class_count=1000)
197210

198211
Lastly, we send an inference request to the Triton Inference Server.
199212

200213
::
201214

202215
# Querying the server
203216
results = client.infer(model_name="resnet50", inputs=[inputs], outputs=[outputs])
204-
inference_output = results.as_numpy('output__0')
217+
inference_output = results.as_numpy('output0')
205218
print(inference_output[:5])
206219

207220
The output should look like below:
@@ -214,3 +227,11 @@ The output should look like below:
214227
The output format here is ``<confidence_score>:<classification_index>``.
215228
To learn how to map these to the label names and more, refer to Triton Inference Server's
216229
`documentation <https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_classification.md>`__.
230+
231+
You can try out this client quickly using
232+
233+
::
234+
235+
# Remember to use the same publishing tag for all steps (e.g. 24.08)
236+
237+
docker run -it --net=host -v ${PWD}:/triton_example nvcr.io/nvidia/tritonserver:YY.MM-py3-sdk bash -c "pip install torchvision && python /triton_example/client.py"

0 commit comments

Comments
 (0)