Upcoming Release Roadmap

Jump to bottom

Sophie Schoenmeyer edited this page Apr 24, 2024 · 8 revisions

ORT 1.18

Target Release: Early May 2024

Announcements

Windows ARM32 support will be dropped at the source code level
Python version >=3.8 will be required for build.bat/build.sh (previously >=3.7). Note: If you have Python version <3.8, you can bypass the tools and use CMake directly.

General

ONNX 1.16 support

Build System & Packages

CoreML execution provider dependency on coremltools
Flatbuffers upgrade from 1.12.0 → 23.5.26
ONNX upgrade from 1.15 → 1.16
EMSDK upgrade from 3.1.51 → 3.1.57
Intel neural_speed library upgrade from v0.1.1 → v0.3 with several important bug fixes
New onnxruntime_CUDA_MINIMAL CMake option for building ONNX Runtime CUDA execution provider without any operations apart from memcpy ops
Catalyst for macOS build support
Initial support for RISC-V and three new build options: --rv64, --riscv_toolchain_root, and --riscv_qemu_path
TensorRT EP support for building with protobuf-lite instead of the full version of protobuf
Some security-related compile/link flags will be moved from the default setting → new build option: --use_binskim_compliant_compile_flags. Note: All our release binaries are built with this flag, but when building ONNX Runtime from source, this flag is default OFF.
Windows ARM64 build dependency on PyTorch CPUINFO library
Windows OneCore build will use “Reverse forwarding” apisets instead of “Direct forwarding”, so onnxruntime.dll in our Nuget packages will depend on kernel32.dll. Note: Windows systems without kernel32.dll need to have reverse forwarders (see https://learn.microsoft.com/en-us/windows/win32/apiindex/api-set-loader-operation for more information).

Core

Additional optimizations related to Dynamo-exported models
Improved testing infrastructure for EPs developed as shared libraries
Reserve() exposure in OrtAllocator to allow custom allocators to work when session.use_device_allocator_for_initializers is specified
Improvements on lock contention due to memory allocations
Session creation time (graph and graph transformer optimizations) improvements
New SessionOptions config entry to disable specific transformers and rules
[C# API] SessionOptions.DisablePerSessionThreads exposure to allow sharing of threadpool between sessions

Performance

4bit quant support improvements
MultiheadAttention performance improvements on CPU

Execution Providers

TensorRT EP
- Finalized support for DDS ops
- Python support for passing user CUDA stream 
- CUDA graph capture enhancements 
- Per-thread context, optimization profile sharing across contexts, and multiple partition 
- Multiple partition EP context support  
QNN EP
- Mixed 8/16 bit precision configurability per layer (current support is a16w8)
- 4bit weight support for Conv2d and MatMul 
- fp16 support
- Encryption support 
- Expanded ONNX model zoo validation
- Improved large model support
- Improved language model support

DirectML

DirectML 1.15 support
Additional ONNX operator support
Additional contrib op support

Mobile

ARM64 4-bit quant support
QNN support on Android
MacCatalyst support
visionOS support
CoreML ML Program model format support

Web

WebGPU perf improvements
Additional WebGPU examples
Additional generative model support
Buffer management optimizations to reduce memory footprint

Training

Large Mode Training
- Dynamo-exported model optimizations
- Speech model optimizations
On-Device Training
- SLM training enablement on edge devices

GenAI

Block-based KV management implementation
Additional ORT EP support
Improved tokenizer and pre-processing
Additional model support
Additional language and packaging support
Perplexity testing
Improved sampling method and ORT model performance
Speculative decoding support