ONNX Runtime v1.9.0
Announcements
- GCC version < 7 is no longer supported
- CMAKE_SYSTEM_PROCESSOR needs be set when cross-compiling on Linux because pytorch cpuinfo was introduced as a dependency for ARM big.LITTLE support. Set it to the value of
uname -m
output of your target device.
General
- ONNX 1.10 support
- opset 15
- ONNX IR 8 (SparseTensor type, model local functionprotos, Optional type not yet fully supported this release)
- Improved documentation of C/C++ APIs
- IBM Power support
- WinML - DLL dependency fix supports learning models on Windows 8.1
- Support for sub-building onnxruntime-extensions and statically linking into onnxruntime binary for custom builds
- Add
--_use_extensions
option to run models with custom operators implemented in onnxruntime-extensions
- Add
APIs
- Registration of a custom allocator for sharing between multiple sessions. (See RegisterAllocator and UnregisterAllocator APIs in onnxruntime_c_api.h)
- SessionOptionsAppendExecutionProvider_TensorRT API is deprecated; use SessionOptionsAppendExecutionProvider_TensorRT_V2
- New APIs: SessionOptionsAppendExecutionProvider_TensorRT_V2, CreateTensorRTProviderOptions, UpdateTensorRTProviderOptions, GetTensorRTProviderOptionsAsString, ReleaseTensorRTProviderOptions, EnableOrtCustomOps, RegisterAllocator, UnregisterAllocator, IsSparseTensor, CreateSparseTensorAsOrtValue, FillSparseTensorCoo, FillSparseTensorCsr, FillSparseTensorBlockSparse, CreateSparseTensorWithValuesAsOrtValue, UseCooIndices, UseCsrIndices, UseBlockSparseIndices, GetSparseTensorFormat, GetSparseTensorValuesTypeAndShape, GetSparseTensorValues, GetSparseTensorIndicesTypeShape, GetSparseTensorIndices,
Performance and quantization
- Performance improvement on ARM
- Added S8S8 (signed int8, signed int8) matmul kernel. This avoids extending uin8 to int16 for better performance on ARM64 without dot-product instruction
- Expanded GEMM udot kernel to 8x8 accumulator
- Added sgemm and qgemm optimized kernels for ARM64EC
- Operator improvements
- Improved performance for quantized operators: DynamicQuantizeLSTM, QLinearAvgPool
- Added new quantized operator QGemm for quantizing Gemm directly
- Fused HardSigmoid and Conv
- Quantization tool - subgraph support
- Transformers tool improvements
- Fused Attention for BART encoder and Megatron GPT-2
- Integrated mixed precision ONNX conversion and parity test for GPT-2
- Updated graph fusion for embed layer normalization for BERT
- Improved symbolic shape inference for operators: Attention, EmbedLayerNormalization, Einsum and Reciprocal
Packages
- Official ORT GPU packages (except Python) now include both CUDA and TensorRT Execution Providers.
- Python packages will be updated next release. Please note that EPs should be explicitly registered to ensure the correct provider is used.
- GPU packages are built with CUDA 11.4 and should be compatible with 11.x on systems with the minimum required driver version. See: CUDA minor version compatibility
- Pypi
- ORT + DirectML Python packages now available: onnxruntime-directml
- GPU package can be used on both CPU-only and GPU machines
- Nuget
- C#: Added support for using netstandard2.0 as a target framework
- Windows symbol (PDB) files are no longer included in the Nuget package, reducing size of the binary Nuget package by 85%. To download, please see the artifacts below in Github.
Execution Providers
-
CUDA EP
-
TensorRT EP
- Added support for TensorRT 8.0 (x64 Windows/Linux, ARM Jetson), which includes new TensorRT explicit-quantization features (ONNX Q/DQ support)
- General fixes and quality improvements
-
OpenVINO EP
- Added support for OpenVINO 2021.4
-
DirectML EP
- Bug fix for Identity with non-float inputs affecting DynamicQuantizeLinear ONNX backend test
ORT Web
- WebAssembly
- SIMD (Single Instruction, Multiple Data) support
- Option to load WebAssembly from worker thread to avoid blocking main UI thread
- wasm file path override
- WebGL
- Simpler workflow for WebGL kernel implementation
- Improved performance with Conv kernel enhancement
ORT Mobile
- Added more example mobile apps
- CoreML and NNAPI EP enhancements
- Reduced peak memory usage when initializing session with ORT format model as bytes
- Enhanced partitioning to improve performance when using NNAPI and CoreML
- Reduce number of NNAPI/CoreML partitions required
- Add ability to force usage of CPU for post-processing in SSD models
- Improves performance by avoiding expensive device copy to/from NPU for cheap post-processing section of the model
- Changed to using xcframework in the iOS package
- Supports usage of arm64 iPhone simulator on Mac with Apple silicon
ORT Training
- Expanding input formats supported to include dictionaries and lists.
- Enable user defined autograd functions
- Support for fallback to PyTorch for execution
- Added support for deterministic compute to enable reproducibility with ORTModule
- Add DebugOptions and LogLevels to ORTModule API* to improve debuggability
- Improvements additions to kernels/gradients: Concat, Split, MatMul, ReluGrad, PadOp, Tile, BatchNormInternal
- Support for ROCm 4.3.1 on AMD GPU
Contributions
Contributors to ONNX Runtime include members across teams at Microsoft, along with our community members:
edgchen1, gwang-msft, tianleiwu, fs-eire, hariharans29, skottmckay, baijumeswani, RyanUnderhill, iK1D, souptc, nkreeger, liqunfu, pengwa, SherlockNoMad, wangyems, chilo-ms, thiagocrepaldi, KeDengMS, suffiank, oliviajain, chenfucn, satyajandhyala, yuslepukhin, pranavsharma, tracysh, yufenglee, hanbitmyths, ytaous, YUNQIUGUO, zhanghuanrong, stevenlix, jywu-msft, chandru-r, duli2012, smk2007, wschin, MaajidKhan, tiagoshibata, xadupre, RandySheriffH, ashbhandare, georgen117, Tixxx, harshithapv, Craigacp, BowenBao, askhade, zhangxiang1993, gramalingam, weixingzhang, natke, tlh20, codemzs, ryanlai2, raviskolli, pranav-prakash, faxu, adtsai, fdwr, wenbingl, jcwchen, neginraoof, cschreib-ibex