General

APIs

New session option to disable default CPU EP fallback session.disable_cpu_ep_fallback
Java
- Support for fp16 and bf16 tensors as inputs and outputs, along with utilities to convert between these and fp32 data. On JDK 20 and newer the fp16 conversion methods use the JDK's Float.float16ToFloat and Float.floatToFloat16 methods which can be hardware accelerated and vectorized on some platforms.
- Support for external initializers so that large models that can be instantiated without filesystem access
C#
- Expose OrtValue API as the new preferred API to run inference in C#. This reduces garbage and exposes direct native memory access via Slice like interfaces.
- Make Float16 and BFloat16 full featured fp16 interfaces that support conversion and expose floating properties (e.g. IsNaN, IsInfinity, etc)
C++
- Make Float16_t and BFloat16_t full featured fp16 interfaces that support conversion and expose floating properties (e.g. IsNaN, IsInfinity, etc)

Improve LLM quantization accuracy with smoothquant
Support 4-bit quantization on CPU
Optimize BeamScore to improve BeamSearch performance
Add FlashAttention v2 support for Attention, MultiHeadAttention and PackedMultiHeadAttention ops

CUDA EP
- Initial fp8 support (QDQ, Cast, MatMul)
- Relax CUDA Graph constraints to allow more models to utilize
- Allow CUDA allocator to be registered with ONNX Runtime externally
- Fixed a build issue with CUDA 12.2 (#16713)
TensorRT EP
- CUDA Graph support
- Support user provided cuda compute stream
- Misc bug fixes and improvements
OpenVINO EP
- Support OpenVINO 2023.1
QNN EP
- Enable context binary cache to reduce initialization time
- Support QNN 2.12
- Support for resize with asymmetric transformation mode on HTP backend
- Ops support: Equal, Less, LessOrEqual, Greater, GreaterOrEqual, LayerNorm, Asin, Sign, DepthToSpace, SpaceToDepth
- Support 1D Conv/ConvTranspose
- Misc bug fixes and improvements

Initial support for Azure EP
Dynamic shape support for CoreML
Improve React Native performance with JSI
Mobile support for CLIPImageProcessor pre-processing and CLIP scenario
Swift Package Manager support for ONNX Runtime inference and ONNX Runtime extensions via onnxruntime-swift-package-manager

ORTModule + OpenAI Triton Integration now available. See details here
Label Sparsity compute optimization support complete and enabled by default starting release 1.16
New experimental embedding sparsity related optimizations available (disabled by default).
- Improves training performance of Roberta in Transformers by 20-30%
Other compute optimizations like Gather/Slice/Reshape upstream support enabled.
Optimizations for LLaMAv2 (~10% acceleration) and OpenAI Whisper
Improvements to logging and metrics (initialization overhead, memory usage, statistics convergence tool, etc) system improvements.
PythonOp enhancement: bool and tuple[bool] constants, materialize grads, empty inputs, save in context, customized shape inference, use full qualified name for export.
SCELossInternal/SCELossGradInternal CUDA kernels can handle elements more than std::numeric_limits<int32_t>::max.
Improvements to LayerNorm fusion
Model cache for exported onnx model is introduced to avoid repeatedly exporting a model that is not changed across.

iOS support available starting this release
Minimal build now available for On-Device Training. Basic binary size ~1.5 MB
ORT-Extensions custom op support enabled through onnxblock for on-device training scenarios

This ORT release is accompanied by updates to onnxruntime-extensions. Features include:

New Python API gen_processing_models to export ONNX data processing model from Huggingface Tokenizers such as LLaMA , CLIP, XLM-Roberta, Falcon, BERT, etc.
New TrieTokenizer operator for RWKV-like LLM models, and other tokenizer operator enhancements.
New operators for Azure EP compatibility: AzureAudioToText, AzureTextToText, AzureTritonInvoker for Python and NuGet packages.
Processing operators have been migrated to the new Lite Custom Op API

ORT CPU Python package requires execution provider to be explicitly provided. See #17631. Fix is in progress to be patched.