Merge branch 'gh-pages' into goodnotes_blog_s2e

microsoft · Nov 18, 2024 · 470995e · 470995e
2 parents a4c1e38 + a9d566c
commit 470995e
Show file tree

Hide file tree

Showing 25 changed files with 558 additions and 244 deletions.
diff --git a/docs/build/eps.md b/docs/build/eps.md
@@ -144,6 +144,8 @@ See more information on the TensorRT Execution Provider [here](../execution-prov
 
 Dockerfile instructions are available [here](https://github.com/microsoft/onnxruntime/tree/main/dockerfiles#tensorrt)
 
+**Note** Building with `--use_tensorrt_oss_parser` with TensorRT 8.X requires additional flag --cmake_extra_defines onnxruntime_USE_FULL_PROTOBUF=ON
+
 ---
 
 ## NVIDIA Jetson TX1/TX2/Nano/Xavier/Orin

diff --git a/docs/execution-providers/CUDA-ExecutionProvider.md b/docs/execution-providers/CUDA-ExecutionProvider.md
@@ -33,15 +33,16 @@ ONNX Runtime Training is aligned with PyTorch CUDA versions; refer to the Optimi
 
 Because of [Nvidia CUDA Minor Version Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/#minor-version-compatibility), ONNX Runtime built with CUDA 11.8 are compatible with any CUDA 11.x version; ONNX Runtime built with CUDA 12.x are compatible with any CUDA 12.x version.
 
-ONNX Runtime built with cuDNN 8.x is not compatible with cuDNN 9.x, and vice versa. You can choose the package based on CUDA and cuDNN major versions that match your runtime environment (For example, PyTorch 2.3 uses cuDNN 8.x, while PyTorch 2.4 or later used cuDNN 9.x).
+ONNX Runtime built with cuDNN 8.x is not compatible with cuDNN 9.x, and vice versa. You can choose the package based on CUDA and cuDNN major versions that match your runtime environment (e.g., PyTorch 2.3 uses cuDNN 8.x, while PyTorch 2.4 or later uses cuDNN 9.x).
 
-Note: starting ORT 1.19, **CUDA 12.x** becomes default version when distributing ONNX Runtime GPU packages in pypi.
+Note: Starting with version 1.19, **CUDA 12.x** becomes the default version when distributing [ONNX Runtime GPU packages](https://pypi.org/project/onnxruntime-gpu/) in PyPI.
 
 ### CUDA 12.x
 
 | ONNX Runtime  | CUDA   | cuDNN | Notes                                                                |
 |---------------|--------|-------|----------------------------------------------------------------------|
-| 1.19.x        | 12.x   | 9.x   | Avaiable in pypi. Compatible with PyTorch >= 2.4.0 for cuda 12.x.    | 
+| 1.20.x        | 12.x   | 9.x   | Avaiable in PyPI. Compatible with PyTorch >= 2.4.0 for CUDA 12.x.    | 
+| 1.19.x        | 12.x   | 9.x   | Avaiable in PyPI. Compatible with PyTorch >= 2.4.0 for CUDA 12.x.    | 
 | 1.18.1        | 12.x   | 9.x   | cuDNN 9 is required. No Java package.                                | 
 | 1.18.0        | 12.x   | 8.x   | Java package is added.                                               |
 | 1.17.x        | 12.x   | 8.x   | Only C++/C# Nuget and Python packages are released. No Java package. |
@@ -50,8 +51,9 @@ Note: starting ORT 1.19, **CUDA 12.x** becomes default version when distributing
 
 | ONNX Runtime         | CUDA   | cuDNN                                   | Notes                                                                                                                                       |
 |----------------------|--------|-----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|
-| 1.19.x               | 11.8   | 8.x                                     | Not available in pypi. See [Install ORT](../install) for detail. Compatible with PyTorch <= 2.3.1 for CUDA 11.8.                            |
-| 1.18.x               | 11.8   | 8.x                                     | Available in pypi                                                                                                                           |
+| 1.20.x               | 11.8   | 8.x                                     | Not available in PyPI. See [Install ORT](../install) for details. Compatible with PyTorch <= 2.3.1 for CUDA 11.8.                           |
+| 1.19.x               | 11.8   | 8.x                                     | Not available in PyPI. See [Install ORT](../install) for details. Compatible with PyTorch <= 2.3.1 for CUDA 11.8.                           |
+| 1.18.x               | 11.8   | 8.x                                     | Available in PyPI.                                                                                                                          |
 | 1.17<br>1.16<br>1.15 | 11.8   | 8.2.4 (Linux)<br/>8.5.0.96 (Windows)    | Tested with CUDA versions from 11.6 up to 11.8, and cuDNN from 8.2 up to 8.9                                                                |
 | 1.14<br>1.13         | 11.6   | 8.2.4 (Linux)<br/>8.5.0.96 (Windows)    | libcudart 11.4.43<br/>libcufft 10.5.2.100<br/>libcurand 10.2.5.120<br/>libcublasLt 11.6.5.2<br/>libcublas 11.6.5.2<br/>libcudnn 8.2.4       |
 | 1.12<br>1.11         | 11.4   | 8.2.4 (Linux)<br/>8.2.2.26 (Windows)    | libcudart 11.4.43<br/>libcufft 10.5.2.100<br/>libcurand 10.2.5.120<br/>libcublasLt 11.6.5.2<br/>libcublas 11.6.5.2<br/>libcudnn 8.2.4       |

diff --git a/docs/execution-providers/DirectML-ExecutionProvider.md b/docs/execution-providers/DirectML-ExecutionProvider.md
@@ -16,7 +16,7 @@ DirectML is a high-performance, hardware-accelerated DirectX 12 library for mach
 
 When used standalone, the DirectML API is a low-level DirectX 12 library and is suitable for high-performance, low-latency applications such as frameworks, games, and other real-time applications. The seamless interoperability of DirectML with Direct3D 12 as well as its low overhead and conformance across hardware makes DirectML ideal for accelerating machine learning when both high performance is desired, and the reliability and predictability of results across hardware is critical.
 
-The DirectML Execution Provider currently uses DirectML version 1.14.1 and supports up to ONNX opset 17 ([ONNX v1.12](https://github.com/onnx/onnx/releases/tag/v1.12.0)). Evaluating models which require a higher opset version is unsupported and will yield poor performance.
+**The DirectML Execution Provider currently uses DirectML version 1.15.2** and supports up to ONNX opset 20 ([ONNX v1.15](https://github.com/onnx/onnx/releases/tag/v1.15.0)) with the exception of Gridsample 20: 5d and DeformConv, which are not yet supported. Evaluating models which require a higher opset version is unsupported and will yield poor performance. *Note: DirectML ONNX opset support may differ from that of ONNX Runtime, which can be found [here](https://onnxruntime.ai/docs/reference/compatibility.html#onnx-opset-support).*
 
 ## Contents
 {: .no_toc }

diff --git a/docs/execution-providers/QNN-ExecutionProvider.md b/docs/execution-providers/QNN-ExecutionProvider.md
@@ -124,8 +124,13 @@ Alternatively to setting profiling_level at compile time, profiling can be enabl
 
 |`"enable_htp_fp16_precision"`|Description [Example](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/c_cxx/QNN_EP/mobilenetv2_classification)|
 |---|---|
-|'0'|default.|
-|'1'|Enable the float32 model to be inferenced with fp16 precision.|
+|'0'|disabled. Inferenced with fp32 precision if it's fp32 model.|
+|'1'|default. Enable the float32 model to be inferenced with fp16 precision.|
+
+|`"offload_graph_io_quantization"`|Description|
+|---|---|
+|'0'|default. Disabled. QNN EP will handle quantization and dequantization of graph I/O.|
+|'1'|Enabled. Offload quantization and dequantization of graph I/O to CPU EP.|
 
 ## Supported ONNX operators
 
@@ -459,20 +464,20 @@ If user creates the QNN context binary .bin file weight sharing from QNN toolcha
 
 ### Inference with QNN resource sharing workflow
 OnnxRuntime inference session need to have resource sharing enabled (set session option ep.share_ep_contexts to 1) to use the dumped Qnn context model with weight sharing enabled.
-1. Create OnnxRuuntime inference session with ep.share_ep_contexts=1, loads the model1.onnx_ctx.onnx model.
- 1.1 The session loads the model1.onnx_ctx.onnx model.
- 1.2 The shared place is empty.
- 1.3 EPContext node1 in model1.onnx_ctx.onnx specifies that it uses Qnn_graph1
- 1.4 QNN EP loads the qnn_ctx.bin and deserialize the binary to get Qnn graphs (Qnn_graph1, Qnn_graph2).
- 1.5 Uses Qnn_graph1 for this OnnxRuntime session.
- 1.6 Put the Qnn_graph2 into the shared place.
-2. Create OnnxRuuntime inference session with ep.share_ep_contexts=1, loads the model2.onnx_ctx.onnx model.
- 2.1 The session loads the model2.onnx_ctx.onnx model.
- 2.2 The EPContext node2 in model2.onnx_ctx.onnx specifies that it uses Qnn_graph2.
- 2.3 The shared place has Qnn_graph2.
- 2.4 QNN EP skips loading qnn_ctx.bin since it gets what it wants from the shared place.
- 2.5 Uses Qnn_graph2 from the shared place for this session.
-3. To avoid issues while existing execution, user needs to destroy the 2nd session first, then the 1st session.
+- Create OnnxRuntime inference session with ep.share_ep_contexts=1, loads the model1.onnx_ctx.onnx model.
+  - The session loads the model1.onnx_ctx.onnx model.
+  - The shared place is empty.
+  - EPContext node1 in model1.onnx_ctx.onnx specifies that it uses Qnn_graph1
+  - QNN EP loads the qnn_ctx.bin and deserialize the binary to get Qnn graphs (Qnn_graph1, Qnn_graph2).
+  - Uses Qnn_graph1 for this OnnxRuntime session.
+  - Put the Qnn_graph2 into the shared place.
+- Create OnnxRuntime inference session with ep.share_ep_contexts=1, loads the model2.onnx_ctx.onnx model.
+  - The session loads the model2.onnx_ctx.onnx model.
+  - The EPContext node2 in model2.onnx_ctx.onnx specifies that it uses Qnn_graph2.
+  - The shared place has Qnn_graph2.
+  - QNN EP skips loading qnn_ctx.bin since it gets what it wants from the shared place.
+  - Uses Qnn_graph2 from the shared place for this session.
+- To avoid issues while existing execution, user needs to destroy the 2nd session first, then the 1st session.
 
 [Code example](https://github.com/microsoft/onnxruntime/blob/291a5352b27ded5714e5748b381f2efb88f28fb9/onnxruntime/test/providers/qnn/qnn_ep_context_test.cc#L979-L992).
 
@@ -502,3 +507,65 @@ sess = ort.InferenceSession(model_path, providers=['QNNExecutionProvider'], prov
 ## Error handling
 ### HTP SubSystem Restart - [SSR](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_backend.html#subsystem-restart-ssr-)
 QNN EP returns StatusCode::ENGINE_ERROR regarding QNN HTP SSR issue. Uppper level framework/application should recreate Onnxruntime session if this error detected during session run.
+
+
+## Add new operator support in QNN EP
+To enable new operator support in EP, areas to visit:
+- QDQ script support this Op? [code example](https://github.com/microsoft/onnxruntime/pull/14867/files#diff-b1ea073c326fef46054382117c256f106d39bd7c34539d44c6e6d9e9eacc059c)
+- Onnxruntime QDQ node unit support this Op? [code example](https://github.com/microsoft/onnxruntime/pull/14867/files#diff-ce0281aaf63e03ecadd592240e41f18742bf8eb095b3725c0e55e589c890946f)
+- Is it layout sensitive operator?
+  - Registered in LayoutTransformer?
+    [code example](https://github.com/microsoft/onnxruntime/blob/6d464748ba7fed2275ecba3a7406298cabc93438/onnxruntime/core/optimizer/transpose_optimizer/transpose_optimizer.cc#L2168)
+  - NHWC op schema registered?
+    Example error message: <lambda_acc29b18d21b7c13448c4952cd957a60>::operator ()] Model face_det_qdq failed to load:Fatal error: com.ms.internal.nhwc:BatchNormalization(9) is not a registered function/op
+    [Example PR](https://github.com/microsoft/onnxruntime/pull/15278)
+
+### Example PRs to enable new operators:
+- Non-layout sensitive operator. [Enable Hardsigmoid for QNN EP using SDK support direct support](https://github.com/microsoft/onnxruntime/pull/20956)
+
+- Layout sensitive operator. [Add InstanceNormalization operator to QNN EP](https://github.com/microsoft/onnxruntime/pull/14867)
+
+
+## Mixed precision support
+The following figure demonstrates an example of mixed precision model.
+<p align="center"><img width="100%" src="../../images/quantization_mixed_precision_1.png" alt="mixed precision model"/></p>
+A mixed precision QDQ model consists of regions with different activation/weight quantization data types. The boundary between regions converts between activation quantization data types (e.g., uint8 to uint16) using a DQ to Q sequence.
+
+The ability to specify regions with different quantization data types enables exploring the tradeoffs between accuracy and latency. A higher integer precision may improve accuracy at the expense of latency, so selectively promoting certain regions to a higher precision can aid in achieving a desirable balance in key metrics.
+
+The following figure shows a model with a region that has been promoted to 16-bit from the default 8-bit activation type.
+<p align="center"><img width="60%" src="../../images/quantization_mixed_precision_2.png" alt="mixed precision layers"/></p>
+
+This model is quantized to uint8 precision, but tensor "Op4_out" is quantized to 16-bit. This can be achieved by specifying the following initial tensor quantization overrides:
+
+```
+# Op4_out could be an inaccurate tensor that should be upgraded to 16bit
+initial_overrides = {"Op4_out": [{"quant_type": QuantType.QUInt16}]}
+
+qnn_config = get_qnn_qdq_config(
+    float_model_path,
+    data_reader,
+    activation_type=QuantType.QUInt8,
+    weight_type=QuantType.QUInt8,
+    init_overrides=initial_overrides,  # These initial overrides will be "fixed"
+)
+```
+
+The above snippet generates the following "fixed" overrides (get via qnn_config.extra_options["TensorQuantOverrides"]):
+
+```
+overrides = {
+  “Op2_out”: [{“quant_type”: QUInt8, “convert”: {“quant_type”: QUInt16, “recv_nodes”: {“Op4”}}}],
+  “Op3_out”: [{“quant_type”: QUInt8, “convert”: {“quant_type”: QUInt16, “recv_nodes”: {“Op5”}}}],
+  “Op4_out”: [{“quant_type”: QUInt16}],
+  “Op5_out”: [{“quant_type”: QUInt16, “convert”: {“quant_type”: QUInt8, “recv_nodes”: {“Op6”}}}]
+}
+```
+
+After the override, the model works like this:
+
+- Op2’s output is consumed by Op4, Op7, and Op8. Op4 consumes the converted u16 type, while Op7 and Op8 consume the original u8 type.
+- Op3’s output is converted from u8 to u16. Op5 consumes the converted u16 type.
+- Op4’s output is just u16 (not converted).
+- Op5’s output is converted from u16 to u8. Op6 consumes the u8 type.
+
diff --git a/docs/execution-providers/TensorRT-ExecutionProvider.md b/docs/execution-providers/TensorRT-ExecutionProvider.md
@@ -20,18 +20,20 @@ The TensorRT execution provider in the ONNX Runtime makes use of NVIDIA's [Tenso
 {:toc}
 
 ## Install
-Please select the GPU (CUDA/TensorRT) version of OnnxRuntime: https://onnxruntime.ai/docs/install. Pre-built packages and Docker images are available for Jetpack in the [Jetson Zoo](https://elinux.org/Jetson_Zoo#ONNX_Runtime).
+Please select the GPU (CUDA/TensorRT) version of Onnx Runtime: https://onnxruntime.ai/docs/install. Pre-built packages and Docker images are available for Jetpack in the [Jetson Zoo](https://elinux.org/Jetson_Zoo#ONNX_Runtime).
 
 ## Build from source
 See [Build instructions](../build/eps.md#tensorrt).
 
 ## Requirements
 
-Note: starting ORT 1.19, **CUDA 12** becomes default version when distributing ONNX Runtime GPU packages.
+Note: Starting with version 1.19, **CUDA 12** becomes the default version when distributing ONNX Runtime GPU packages.
 
 | ONNX Runtime | TensorRT | CUDA           |
 | :----------- | :------- | :------------- |
-| 1.19-main    | 10.2     | **12.x**, 11.8 |
+| main         | 10.5     | **12.x**, 11.8 |
+| 1.20         | 10.5     | **12.x**, 11.8 |
+| 1.19         | 10.2     | **12.x**, 11.8 |
 | 1.18         | 10.0     | 11.8, 12.x     |
 | 1.17         | 8.6      | 11.8, 12.x     |
 | 1.16         | 8.6      | 11.8           |
@@ -822,3 +824,7 @@ This example shows how to run the Faster R-CNN model on TensorRT execution provi
     ```
 
 Please see [this Notebook](https://github.com/microsoft/onnxruntime/blob/main/docs/python/notebooks/onnx-inference-byoc-gpu-cpu-aks.ipynb) for an example of running a model on GPU using ONNX Runtime through Azure Machine Learning Services.
+
+## Known Issues
+- TensorRT 8.6 built-in parser and TensorRT oss parser behaves differently. Namely built-in parser cannot recognize some custom plugin ops while OSS parser can. See [EfficientNMS_TRT missing attribute class_agnostic w/ TensorRT 8.6
+](https://github.com/microsoft/onnxruntime/issues/16121). 
diff --git a/docs/genai/api/c.md b/docs/genai/api/c.md
@@ -414,6 +414,25 @@ A pointer to the token sequence
 OGA_EXPORT const int32_t* OGA_API_CALL OgaGenerator_GetSequenceData(const OgaGenerator* generator, size_t index);
 ```
 
+### Set Runtime Option
+
+An API to set Runtime options, more parameters will be added to this generic API to support Runtime options. An example to use this API for terminating the current session would be to call the SetRuntimeOption with key as "terminate_session" and value as "1": OgaGenerator_SetRuntimeOption(generator, "terminate_session", "1")
+
+More details on the current runtime options can be found [here](https://github.com/microsoft/onnxruntime-genai/blob/main/documents/Runtime_option.md).
+
+#### Parameters
+
+* Input: generator The generator on which the Runtime option needs to be set
+* Input: key The key for setting the runtime option
+* Input: value The value for the key provided
+
+#### Returns
+`void`
+
+```c
+OGA_EXPORT void OGA_API_CALL OgaGenerator_SetRuntimeOption(OgaGenerator* generator, const char* key, const char* value);
+```
+
 ## Enums and structs
 
 ```c