Skip to content

Upcoming Release Roadmap

Sophie Schoenmeyer edited this page Apr 24, 2024 · 8 revisions

ORT 1.18

Target Release: Early May 2024

Announcements

  • Windows ARM32 support will be dropped at the source code level
  • Python version >=3.8 will be required for build.bat/build.sh (previously >=3.7). Note: If you have Python version <3.8, you can bypass the tools and use CMake directly.

General

  • ONNX 1.16 support

Build System & Packages

  • CoreML execution provider dependency on coremltools
  • Flatbuffers upgrade from 1.12.0 → 23.5.26
  • ONNX upgrade from 1.15 → 1.16
  • EMSDK upgrade from 3.1.51 → 3.1.57
  • Intel neural_speed library upgrade from v0.1.1 → v0.3 with several important bug fixes
  • New onnxruntime_CUDA_MINIMAL CMake option for building ONNX Runtime CUDA execution provider without any operations apart from memcpy ops
  • Catalyst for macOS build support
  • Initial support for RISC-V and three new build options: --rv64, --riscv_toolchain_root, and --riscv_qemu_path
  • TensorRT EP support for building with protobuf-lite instead of the full version of protobuf
  • Some security-related compile/link flags will be moved from the default setting → new build option: --use_binskim_compliant_compile_flags. Note: All our release binaries are built with this flag, but when building ONNX Runtime from source, this flag is default OFF.
  • Windows ARM64 build dependency on PyTorch CPUINFO library
  • Windows OneCore build will use “Reverse forwarding” apisets instead of “Direct forwarding”, so onnxruntime.dll in our Nuget packages will depend on kernel32.dll. Note: Windows systems without kernel32.dll need to have reverse forwarders (see https://learn.microsoft.com/en-us/windows/win32/apiindex/api-set-loader-operation for more information).

Core

  • Additional optimizations related to Dynamo-exported models
  • Improved testing infrastructure for EPs developed as shared libraries
  • Reserve() exposure in OrtAllocator to allow custom allocators to work when session.use_device_allocator_for_initializers is specified
  • Improvements on lock contention due to memory allocations
  • Session creation time (graph and graph transformer optimizations) improvements
  • New SessionOptions config entry to disable specific transformers and rules
  • [C# API] SessionOptions.DisablePerSessionThreads exposure to allow sharing of threadpool between sessions

Performance

  • 4bit quant support improvements
  • MultiheadAttention performance improvements on CPU

Execution Providers

  • TensorRT EP
    • Finalized support for DDS ops
    • Python support for passing user CUDA stream 
    • CUDA graph capture enhancements 
    • Per-thread context, optimization profile sharing across contexts, and multiple partition 
    • Multiple partition EP context support  
  • QNN EP
    • Mixed 8/16 bit precision configurability per layer (current support is a16w8)
    • 4bit weight support for Conv2d and MatMul 
    • fp16 support
    • Encryption support 
    • Expanded ONNX model zoo validation
    • Improved large model support
    • Improved language model support

DirectML

  • DirectML 1.15 support
  • Additional ONNX operator support
  • Additional contrib op support

Mobile

  • ARM64 4-bit quant support
  • QNN support on Android
  • MacCatalyst support
  • visionOS support
  • CoreML ML Program model format support

Web

  • WebGPU perf improvements
  • Additional WebGPU examples
  • Additional generative model support
  • Buffer management optimizations to reduce memory footprint

Training

  • Large Mode Training
    • Dynamo-exported model optimizations
    • Speech model optimizations
  • On-Device Training
    • SLM training enablement on edge devices

GenAI

  • Block-based KV management implementation
  • Additional ORT EP support
  • Improved tokenizer and pre-processing
  • Additional model support
  • Additional language and packaging support
  • Perplexity testing
  • Improved sampling method and ORT model performance
  • Speculative decoding support