You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The section of model config file specifying this parameter will look like:
314
+
315
+
```proto
316
+
parameters: {
317
+
key: "DISABLE_OPTIMIZED_EXECUTION"
318
+
value: { string_value: "true" }
319
+
}
320
+
```
321
+
322
+
*`INFERENCE_MODE`:
323
+
324
+
Boolean flag to enable the Inference Mode execution of TorchScript models.
325
+
By default, the inference mode is enabled.
326
+
327
+
[InferenceMode](https://pytorch.org/cppdocs/notes/inference_mode.html) is a new RAII guard analogous to `NoGradMode` to be used when you are certain your operations will have no interactions with autograd.
328
+
Compared to `NoGradMode`, code run under this mode gets better performance by disabling autograd.
329
+
330
+
Please note that in some models, InferenceMode might not benefit performance and in fewer cases might impact performance negatively.
331
+
332
+
To enable inference mode, use the configuration example below:
333
+
334
+
```proto
335
+
parameters: {
336
+
key: "INFERENCE_MODE"
337
+
value: { string_value: "true" }
338
+
}
339
+
```
340
+
341
+
*`DISABLE_CUDNN`:
342
+
343
+
Boolean flag to disable the cuDNN library.
344
+
By default, cuDNN is enabled.
345
+
346
+
[cuDNN](https://developer.nvidia.com/cudnn) is a GPU-accelerated library of primitives for deep neural networks.
347
+
It provides highly tuned implementations for standard routines.
348
+
349
+
Typically, models run with cuDNN enabled execute faster.
350
+
However there are some exceptions where using cuDNN can be slower, cause higher memory usage, or result in errors.
351
+
352
+
To disable cuDNN, use the configuration example below:
353
+
354
+
```proto
355
+
parameters: {
356
+
key: "DISABLE_CUDNN"
357
+
value: { string_value: "true" }
358
+
}
359
+
```
360
+
361
+
*`ENABLE_WEIGHT_SHARING`:
362
+
363
+
Boolean flag to enable model instances on the same device to share weights.
364
+
This optimization should not be used with stateful models.
365
+
If not specified, weight sharing is disabled.
366
+
367
+
To enable weight sharing, use the configuration example below:
368
+
369
+
```proto
370
+
parameters: {
371
+
key: "ENABLE_WEIGHT_SHARING"
372
+
value: { string_value: "true" }
373
+
}
374
+
```
375
+
376
+
*`ENABLE_CACHE_CLEANING`:
377
+
378
+
Boolean flag to enable CUDA cache cleaning after each model execution.
379
+
If not specified, cache cleaning is disabled.
380
+
This flag has no effect if model is on CPU.
381
+
382
+
Setting this flag to true will likely negatively impact the performance due to additional CUDA cache cleaning operation after each model execution.
383
+
Therefore, you should only use this flag if you serve multiple models with Triton and encounter CUDA out-of-memory issues during model executions.
384
+
385
+
To enable cleaning of the CUDA cache after every execution, use the configuration example below:
386
+
387
+
```proto
388
+
parameters: {
389
+
key: "ENABLE_CACHE_CLEANING"
390
+
value: { string_value: "true" }
391
+
}
392
+
```
393
+
394
+
*`INTER_OP_THREAD_COUNT`:
395
+
396
+
PyTorch allows using multiple CPU threads during TorchScript model inference.
397
+
One or more inference threads execute a model’s forward pass on the given inputs.
398
+
Each inference thread invokes a JIT interpreter that executes the ops of a model inline, one by one.
399
+
400
+
This parameter sets the size of this thread pool.
401
+
The default value of this setting is the number of cpu cores.
To set the intra-op thread count, use the configuration example below:
435
+
436
+
```proto
437
+
parameters: {
438
+
key: "INTRA_OP_THREAD_COUNT"
439
+
value: { string_value: "1" }
440
+
}
441
+
```
442
+
443
+
***Additional Optimizations**:
444
+
445
+
Three additional boolean parameters are available to disable certain Torch optimizations that can sometimes cause latency regressions in models with complex execution modes and dynamic shapes.
446
+
If not specified, all are enabled by default.
447
+
448
+
`ENABLE_JIT_EXECUTOR`
449
+
450
+
`ENABLE_JIT_PROFILING`
451
+
452
+
### Model Instance Group Kind
453
+
454
+
The PyTorch backend supports the following kinds of
0 commit comments