Announcements

General

ONNX 1.13 support (opset 18)
Threading
- ORT Threadpool is now NUMA aware (details)
- New API to set thread affinity (details)
New custom operator APIs
- Enables a custom operator to wrap an entire model that is meant to be inferenced with an external API or runtime.
- Details and example
Multi-stream Execution Provider refactoring
- Improves GPU utilization by putting parallel inference requests on different GPU streams. Updated for CUDA, TensorRT, and ROCM execution providers
- Improves memory efficiency by enabling GPU memory reuse across different streams
- Enables Execution Provider developer to customize its stream implementation by providing "Stream" interface in ExecutionProvider API
[Preview] Rust API for ORT - not part of release branch but available to build in main.

Support of quantization with AMX on Sapphire Rapids processors
CUDA EP performance improvements:
- Improve performance of transformer models and decoding methods: beam search, greedy search, and topp sampling.
- Stable Diffusion model optimizations
- Change cudnn_conv_use_max_workspace default value to be 1
Performance improvements to GRU and Slice operators

TensorRT EP
- Adds support for TensorRT 8.5 GA versions
- Bug fixes
OpenVINO EP
- Adds support for OpenVINO 2022.3
DirectML EP:
- Updated to DML 1.10.1
- Additional operators: NonZero, Shape, Size, Attention, EmbedLayerNorm, SkipLayerNorm, BiasGelu
- Additional data types: Abs, Sign, Where
- Enable SetOptimizedFilePath export/reload
- Bug fixes/extensions: allow squeeze-13 axes, EinSum with MatMul NHCW
ROCm EP: 5.4 support and GA ready
[Preview] Azure EP - supports AzureML hosted models using Triton for hybrid inferencing on-device and on-cloud

Pre/Post processing
- Support updating mobilenet and super resolution models to move the pre and post processing into the model, including usage of custom ops for conversion to/from jpg/png
  - onnxruntime-extensions python package includes the model update script to add pre/post processing to the model
  - See example model update usage
- [Coming soon] onnxruntime-extensions packages for Android and iOS with DecodeImage and EncodeImage custom ops
- Updated the onnxruntime inference examples to demonstrate end-to-end usage with onnxruntime-extensions package
  - SuperResolution model
XNNPACK
- Added support for additional commonly used operators
- Add iOS build support
  - XNNPACK EP is now included in the onnxruntime-c iOS package
- Added support for using the ORT allocator in XNNPACK kernels to minimize memory usage

onnxruntime-extensions included in default ort-web build (NLP centric)
XNNPACK Gemm
Improved exception handling
New utility functions (experimental) to help with exchanging data between images and tensors.

Performance optimizations and bug fixes for Hugging Face models (i.e. Xlnet and Bloom)
Stable diffusion optimizations for training, including support for Resize and InstanceNorm gradients and addition of ORT-enabled examples to the diffusers library
FP16 optimizer exposed in torch-ort (details)
Bug fixes for Hugging Face models

The Microsoft.ML.OnnxRuntime.DirectML package name includes -dev-* suffix. This is functionally equivalent to the release branch build, and a patch is in progress.