-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Upcoming Release Roadmap
Sophie Schoenmeyer edited this page Apr 24, 2024
·
8 revisions
Target Release: Early May 2024
- Windows ARM32 support will be dropped at the source code level
- Python version >=3.8 will be required for build.bat/build.sh (previously >=3.7). Note: If you have Python version <3.8, you can bypass the tools and use CMake directly.
- ONNX 1.16 support
- CoreML execution provider dependency on coremltools
- Flatbuffers upgrade from 1.12.0 → 23.5.26
- ONNX upgrade from 1.15 → 1.16
- EMSDK upgrade from 3.1.51 → 3.1.57
- Intel neural_speed library upgrade from v0.1.1 → v0.3 with several important bug fixes
- New onnxruntime_CUDA_MINIMAL CMake option for building ONNX Runtime CUDA execution provider without any operations apart from memcpy ops
- Catalyst for macOS build support
- Initial support for RISC-V and three new build options:
--rv64
,--riscv_toolchain_root
, and--riscv_qemu_path
- TensorRT EP support for building with protobuf-lite instead of the full version of protobuf
- Some security-related compile/link flags will be moved from the default setting → new build option:
--use_binskim_compliant_compile_flags
. Note: All our release binaries are built with this flag, but when building ONNX Runtime from source, this flag is default OFF. - Windows ARM64 build dependency on PyTorch CPUINFO library
- Windows OneCore build will use “Reverse forwarding” apisets instead of “Direct forwarding”, so onnxruntime.dll in our Nuget packages will depend on kernel32.dll. Note: Windows systems without kernel32.dll need to have reverse forwarders (see https://learn.microsoft.com/en-us/windows/win32/apiindex/api-set-loader-operation for more information).
- Additional optimizations related to Dynamo-exported models
- Improved testing infrastructure for EPs developed as shared libraries
- Reserve() exposure in OrtAllocator to allow custom allocators to work when session.use_device_allocator_for_initializers is specified
- Improvements on lock contention due to memory allocations
- Session creation time (graph and graph transformer optimizations) improvements
- New SessionOptions config entry to disable specific transformers and rules
- [C# API] SessionOptions.DisablePerSessionThreads exposure to allow sharing of threadpool between sessions
- 4bit quant support improvements
- MultiheadAttention performance improvements on CPU
- TensorRT EP
- Finalized support for DDS ops
- Python support for passing user CUDA stream
- CUDA graph capture enhancements
- Per-thread context, optimization profile sharing across contexts, and multiple partition
- Multiple partition EP context support
- QNN EP
- Mixed 8/16 bit precision configurability per layer (current support is a16w8)
- 4bit weight support for Conv2d and MatMul
- fp16 support
- Encryption support
- Expanded ONNX model zoo validation
- Improved large model support
- Improved language model support
- DirectML 1.15 support
- Additional ONNX operator support
- Additional contrib op support
- ARM64 4-bit quant support
- QNN support on Android
- MacCatalyst support
- visionOS support
- CoreML ML Program model format support
- WebGPU perf improvements
- Additional WebGPU examples
- Additional generative model support
- Buffer management optimizations to reduce memory footprint
- Large Mode Training
- Dynamo-exported model optimizations
- Speech model optimizations
- On-Device Training
- SLM training enablement on edge devices
- Block-based KV management implementation
- Additional ORT EP support
- Improved tokenizer and pre-processing
- Additional model support
- Additional language and packaging support
- Perplexity testing
- Improved sampling method and ORT model performance
- Speculative decoding support
Please use the learning roadmap on the home wiki page for building general understanding of ORT.