Skip to content

Commit 21e7927

Browse files
committed
docs: Correct README Instructions (#164)
This change corrects the instruction for how to use PyTorch 2 with the backend.
1 parent fa0dd59 commit 21e7927

File tree

1 file changed

+212
-8
lines changed

1 file changed

+212
-8
lines changed

README.md

Lines changed: 212 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -248,23 +248,23 @@ complex execution modes and dynamic shapes. If not specified, all are enabled by
248248

249249
### PyTorch 2.0 Models
250250

251+
PyTorch 2.0 features are available.
252+
However, Triton's PyTorch backend requires a serialized representation of the model in the form a `model.pt` file.
253+
The serialized representation of the model can be generated using PyTorch's
254+
[`torch.save()`](https://docs.pytorch.org/tutorials/beginner/saving_loading_models.html#id1)
255+
function to generate the `model.pt` file.
256+
251257
The model repository should look like:
252258

253259
```bash
254260
model_repository/
255261
`-- model_directory
256262
|-- 1
257-
| |-- model.py
258-
| `-- [model.pt]
263+
| `-- model.pt
259264
`-- config.pbtxt
260265
```
261266

262-
The `model.py` contains the class definition of the PyTorch model.
263-
The class should extend the
264-
[`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module).
265-
The `model.pt` may be optionally provided which contains the saved
266-
[`state_dict`](https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-for-inference)
267-
of the model.
267+
Where `model.pt` is the serialized representation of the model.
268268

269269
### TorchScript Models
270270

@@ -280,6 +280,210 @@ model_repository/
280280

281281
The `model.pt` is the TorchScript model file.
282282

283+
## Configuration
284+
285+
Triton exposes some flags to control the execution mode of the TorchScript models through the `Parameters` section of the model's `config.pbtxt` file.
286+
287+
### Configuration Options
288+
289+
* `default_model_name`:
290+
Instructs the Triton PyTorch backend to load the model from a file of the given name.
291+
292+
The model config specifying the option would look like:
293+
294+
```proto
295+
default_model_name: "another_file_name.pt"
296+
```
297+
298+
### Parameters
299+
300+
* `DISABLE_OPTIMIZED_EXECUTION`:
301+
Boolean flag to disable the optimized execution of TorchScript models.
302+
By default, the optimized execution is always enabled.
303+
304+
The initial calls to a loaded TorchScript model take a significant amount of time.
305+
Due to this longer model warmup
306+
([pytorch #57894](https://github.com/pytorch/pytorch/issues/57894)),
307+
Triton also allows execution of models without these optimizations.
308+
In some models, optimized execution does not benefit performance
309+
([pytorch #19978](https://github.com/pytorch/pytorch/issues/19978))
310+
and in other cases impacts performance negatively
311+
([pytorch #53824](https://github.com/pytorch/pytorch/issues/53824)).
312+
313+
The section of model config file specifying this parameter will look like:
314+
315+
```proto
316+
parameters: {
317+
key: "DISABLE_OPTIMIZED_EXECUTION"
318+
value: { string_value: "true" }
319+
}
320+
```
321+
322+
* `INFERENCE_MODE`:
323+
324+
Boolean flag to enable the Inference Mode execution of TorchScript models.
325+
By default, the inference mode is enabled.
326+
327+
[InferenceMode](https://pytorch.org/cppdocs/notes/inference_mode.html) is a new RAII guard analogous to `NoGradMode` to be used when you are certain your operations will have no interactions with autograd.
328+
Compared to `NoGradMode`, code run under this mode gets better performance by disabling autograd.
329+
330+
Please note that in some models, InferenceMode might not benefit performance and in fewer cases might impact performance negatively.
331+
332+
To enable inference mode, use the configuration example below:
333+
334+
```proto
335+
parameters: {
336+
key: "INFERENCE_MODE"
337+
value: { string_value: "true" }
338+
}
339+
```
340+
341+
* `DISABLE_CUDNN`:
342+
343+
Boolean flag to disable the cuDNN library.
344+
By default, cuDNN is enabled.
345+
346+
[cuDNN](https://developer.nvidia.com/cudnn) is a GPU-accelerated library of primitives for deep neural networks.
347+
It provides highly tuned implementations for standard routines.
348+
349+
Typically, models run with cuDNN enabled execute faster.
350+
However there are some exceptions where using cuDNN can be slower, cause higher memory usage, or result in errors.
351+
352+
To disable cuDNN, use the configuration example below:
353+
354+
```proto
355+
parameters: {
356+
key: "DISABLE_CUDNN"
357+
value: { string_value: "true" }
358+
}
359+
```
360+
361+
* `ENABLE_WEIGHT_SHARING`:
362+
363+
Boolean flag to enable model instances on the same device to share weights.
364+
This optimization should not be used with stateful models.
365+
If not specified, weight sharing is disabled.
366+
367+
To enable weight sharing, use the configuration example below:
368+
369+
```proto
370+
parameters: {
371+
key: "ENABLE_WEIGHT_SHARING"
372+
value: { string_value: "true" }
373+
}
374+
```
375+
376+
* `ENABLE_CACHE_CLEANING`:
377+
378+
Boolean flag to enable CUDA cache cleaning after each model execution.
379+
If not specified, cache cleaning is disabled.
380+
This flag has no effect if model is on CPU.
381+
382+
Setting this flag to true will likely negatively impact the performance due to additional CUDA cache cleaning operation after each model execution.
383+
Therefore, you should only use this flag if you serve multiple models with Triton and encounter CUDA out-of-memory issues during model executions.
384+
385+
To enable cleaning of the CUDA cache after every execution, use the configuration example below:
386+
387+
```proto
388+
parameters: {
389+
key: "ENABLE_CACHE_CLEANING"
390+
value: { string_value: "true" }
391+
}
392+
```
393+
394+
* `INTER_OP_THREAD_COUNT`:
395+
396+
PyTorch allows using multiple CPU threads during TorchScript model inference.
397+
One or more inference threads execute a model’s forward pass on the given inputs.
398+
Each inference thread invokes a JIT interpreter that executes the ops of a model inline, one by one.
399+
400+
This parameter sets the size of this thread pool.
401+
The default value of this setting is the number of cpu cores.
402+
403+
> [!TIP]
404+
> Refer to
405+
> [CPU Threading TorchScript](https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html)
406+
> on how to set this parameter properly.
407+
408+
To set the inter-op thread count, use the configuration example below:
409+
410+
```proto
411+
parameters: {
412+
key: "INTER_OP_THREAD_COUNT"
413+
value: { string_value: "1" }
414+
}
415+
```
416+
417+
> [!NOTE]
418+
> This parameter is set globally for the PyTorch backend.
419+
> The value from the first model config file that specifies this parameter will be used.
420+
> Subsequent values from other model config files, if different, will be ignored.
421+
422+
* `INTRA_OP_THREAD_COUNT`:
423+
424+
In addition to the inter-op parallelism, PyTorch can also utilize multiple threads within the ops (intra-op parallelism).
425+
This can be useful in many cases, including element-wise ops on large tensors, convolutions, GEMMs, embedding lookups and others.
426+
427+
The default value for this setting is the number of CPU cores.
428+
429+
> [!TIP]
430+
> Refer to
431+
> [CPU Threading TorchScript](https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html)
432+
> on how to set this parameter properly.
433+
434+
To set the intra-op thread count, use the configuration example below:
435+
436+
```proto
437+
parameters: {
438+
key: "INTRA_OP_THREAD_COUNT"
439+
value: { string_value: "1" }
440+
}
441+
```
442+
443+
* **Additional Optimizations**:
444+
445+
Three additional boolean parameters are available to disable certain Torch optimizations that can sometimes cause latency regressions in models with complex execution modes and dynamic shapes.
446+
If not specified, all are enabled by default.
447+
448+
`ENABLE_JIT_EXECUTOR`
449+
450+
`ENABLE_JIT_PROFILING`
451+
452+
### Model Instance Group Kind
453+
454+
The PyTorch backend supports the following kinds of
455+
[Model Instance Groups](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups)
456+
where the input tensors are placed as follows:
457+
458+
* `KIND_GPU`:
459+
460+
Inputs are prepared on the GPU device associated with the model instance.
461+
462+
* `KIND_CPU`:
463+
464+
Inputs are prepared on the CPU.
465+
466+
* `KIND_MODEL`:
467+
468+
Inputs are prepared on the CPU.
469+
When loading the model, the backend does not choose the GPU device for the model;
470+
instead, it respects the device(s) specified in the model and uses them as they are during inference.
471+
472+
This is useful when the model internally utilizes multiple GPUs, as demonstrated in
473+
[this example model](https://github.com/triton-inference-server/server/blob/main/qa/L0_libtorch_instance_group_kind_model/gen_models.py).
474+
475+
> [!IMPORTANT]
476+
> If a device is not specified in the model, the backend uses the first available GPU device.
477+
478+
To set the model instance group, use the configuration example below:
479+
480+
```proto
481+
instance_group {
482+
count: 2
483+
kind: KIND_GPU
484+
}
485+
```
486+
283487
### Customization
284488

285489
The following PyTorch settings may be customized by setting parameters on the

0 commit comments

Comments
 (0)