ONNX Runtime v1.16.0
General
- Support for serialization of models >=2GB
APIs
- New session option to disable default CPU EP fallback
session.disable_cpu_ep_fallback
- Java
- Support for fp16 and bf16 tensors as inputs and outputs, along with utilities to convert between these and fp32 data. On JDK 20 and newer the fp16 conversion methods use the JDK's Float.float16ToFloat and Float.floatToFloat16 methods which can be hardware accelerated and vectorized on some platforms.
- Support for external initializers so that large models that can be instantiated without filesystem access
- C#
- Expose OrtValue API as the new preferred API to run inference in C#. This reduces garbage and exposes direct native memory access via Slice like interfaces.
- Make Float16 and BFloat16 full featured fp16 interfaces that support conversion and expose floating properties (e.g. IsNaN, IsInfinity, etc)
- C++
- Make Float16_t and BFloat16_t full featured fp16 interfaces that support conversion and expose floating properties (e.g. IsNaN, IsInfinity, etc)
Performance
- Improve LLM quantization accuracy with smoothquant
- Support 4-bit quantization on CPU
- Optimize BeamScore to improve BeamSearch performance
- Add FlashAttention v2 support for Attention, MultiHeadAttention and PackedMultiHeadAttention ops
Execution Providers
- CUDA EP
- Initial fp8 support (QDQ, Cast, MatMul)
- Relax CUDA Graph constraints to allow more models to utilize
- Allow CUDA allocator to be registered with ONNX Runtime externally
- Fixed a build issue with CUDA 12.2 (#16713)
- TensorRT EP
- CUDA Graph support
- Support user provided cuda compute stream
- Misc bug fixes and improvements
- OpenVINO EP
- Support OpenVINO 2023.1
- QNN EP
- Enable context binary cache to reduce initialization time
- Support QNN 2.12
- Support for resize with asymmetric transformation mode on HTP backend
- Ops support: Equal, Less, LessOrEqual, Greater, GreaterOrEqual, LayerNorm, Asin, Sign, DepthToSpace, SpaceToDepth
- Support 1D Conv/ConvTranspose
- Misc bug fixes and improvements
Mobile
- Initial support for Azure EP
- Dynamic shape support for CoreML
- Improve React Native performance with JSI
- Mobile support for CLIPImageProcessor pre-processing and CLIP scenario
- Swift Package Manager support for ONNX Runtime inference and ONNX Runtime extensions via onnxruntime-swift-package-manager
Web
- webgpu ops coverage improvements (SAM, T5, Whisper)
- webnn ops coverage improvements (SAM, Stable Diffusion)
- Stability/usability improvements for webgpu
Large model training
- ORTModule + OpenAI Triton Integration now available. See details here
- Label Sparsity compute optimization support complete and enabled by default starting release 1.16
- New experimental embedding sparsity related optimizations available (disabled by default).
- Improves training performance of Roberta in Transformers by 20-30%
- Other compute optimizations like Gather/Slice/Reshape upstream support enabled.
- Optimizations for LLaMAv2 (~10% acceleration) and OpenAI Whisper
- Improvements to logging and metrics (initialization overhead, memory usage, statistics convergence tool, etc) system improvements.
- PythonOp enhancement: bool and tuple[bool] constants, materialize grads, empty inputs, save in context, customized shape inference, use full qualified name for export.
- SCELossInternal/SCELossGradInternal CUDA kernels can handle elements more than std::numeric_limits<int32_t>::max.
- Improvements to LayerNorm fusion
- Model cache for exported onnx model is introduced to avoid repeatedly exporting a model that is not changed across.
On-Device Training
- iOS support available starting this release
- Minimal build now available for On-Device Training. Basic binary size ~1.5 MB
- ORT-Extensions custom op support enabled through onnxblock for on-device training scenarios
ORT Extensions
This ORT release is accompanied by updates to onnxruntime-extensions. Features include:
- New Python API gen_processing_models to export ONNX data processing model from Huggingface Tokenizers such as LLaMA , CLIP, XLM-Roberta, Falcon, BERT, etc.
- New TrieTokenizer operator for RWKV-like LLM models, and other tokenizer operator enhancements.
- New operators for Azure EP compatibility: AzureAudioToText, AzureTextToText, AzureTritonInvoker for Python and NuGet packages.
- Processing operators have been migrated to the new Lite Custom Op API
Known Issues
- ORT CPU Python package requires execution provider to be explicitly provided. See #17631. Fix is in progress to be patched.
Contributions
Contributors to ONNX Runtime include members across teams at Microsoft, along with our community members:
fs-eire, edgchen1, snnn, pengwa, mszhanyi, PeixuanZuo, tianleiwu, adrianlizarraga, baijumeswani, cloudhan, satyajandhyala, yuslepukhin, RandyShuai, RandySheriffH, skottmckay, Honry, dependabot[bot], HectorSVC, jchen351, chilo-ms, YUNQIUGUO, justinchuby, PatriceVignola, guschmue, yf711, Craigacp, smk2007, RyanUnderhill, jslhcl, wschin, kunal-vaishnavi, mindest, xadupre, fdwr, hariharans29, AdamLouly, wejoncy, chenfucn, pranavsharma, yufenglee, zhijxu-MS, jeffdaily, natke, jeffbloo, liqunfu, wangyems, er3x3, nums11, yihonglyu, sumitsays, zhanghuanrong, askhade, wenbingl, jingyanwangms, ashari4, gramalingam, georgen117, sfatimar, BowenBao, hanbitmyths, stevenlix, jywu-msft