Skip to content

Conversation

@onuralpszr
Copy link
Member

@onuralpszr onuralpszr commented Jan 10, 2026

This pull request introduces several significant improvements and new features to the Ultralytics inference Rust library, focusing on enhanced performance, usability, and expanded functionality. Major highlights include support for rectangular and batch inference, improved hardware acceleration options, expanded CLI arguments, and optimizations for preprocessing and post-processing. The documentation and example outputs have also been updated to reflect these changes.

New Features and CLI Enhancements:

  • Added rectangular inference (--rect) and batch inference (--batch) support, both enabled/configurable via CLI and passed through the inference pipeline. [1] [2] [3]
  • Increased default IoU threshold for NMS to 0.7, raised default max detections to 300, and exposed these as CLI arguments. [1] [2] [3] [4]
  • Expanded device selection to include more hardware acceleration options (CUDA, TensorRT, CoreML, OpenVINO, XNNPACK), and improved CLI help/examples.

Performance and Preprocessing Optimizations:

  • Added SIMD-accelerated preprocessing via the wide crate and introduced an LRU cache for preprocessing LUTs for faster image handling. [1] [2]
  • Switched to "fat" LTO in release builds for improved optimization.
  • On Linux, configured RPATH in .cargo/config.toml to simplify shared library loading.

Batch Processing and Pipeline Improvements:

  • Implemented a pipelined, multi-threaded batch processing system using bounded channels between frame decoding and inference, improving throughput and responsiveness.
  • Centralized batch management in the prediction pipeline for more efficient processing.

Documentation and Example Updates:

  • Updated README.md with new CLI options, example commands, output samples, and a detailed breakdown of the codebase structure and dependencies. [1] [2] [3] [4] [5]
  • Revised example output to reflect new defaults, improved speed, and updated versioning.
  • Added new features to the "Features" checklist and clarified in-progress items.

Codebase and Dependency Updates:

  • Bumped crate version to 0.0.8 and added new dependencies (wide, lru) for preprocessing and caching. [1] [2]
  • Expanded and clarified module structure in documentation, highlighting new modules for batch processing, device management, annotation, I/O, and logging.

These changes collectively make the library faster, more flexible, and easier to use for a wider range of inference scenarios.


New Features and CLI Enhancements

  • Added rectangular inference (--rect), batch inference (--batch), and expanded CLI options for IoU, max detections, and device selection. Updated CLI help and examples accordingly. [1] [2] [3] [4] [5] [6] [7] [8]

Performance and Pipeline Improvements

  • Introduced SIMD-accelerated preprocessing (wide), LRU cache for LUTs (lru), and improved release build optimization with "fat" LTO. [1] [2] [3]
  • Implemented pipelined, multi-threaded batch processing using bounded channels for better throughput.

Hardware Acceleration and Platform Support

  • Added support for more hardware acceleration backends (CUDA, TensorRT, CoreML, OpenVINO, XNNPACK) and configured Linux RPATH for runtime library loading. [1] [2] [3]

Documentation and Example Updates

  • Updated README.md with new CLI options, example outputs, features checklist, and detailed module/dependency breakdowns. [1] [2] [3] [4] [5]

Codebase and Dependency Updates

  • Bumped version to 0.0.8, added new dependencies, and clarified module structure in documentation. [1] [2] [3]

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Ultralytics Inference 0.0.8 boosts performance and usability with SIMD-accelerated pre/postprocessing, rectangular + batch inference, improved CLI defaults, and smoother Linux/device/runtime behavior 🚀

📊 Key Changes

  • Major speedups in preprocessing: replaced the old letterbox pipeline with a fused, zero-copy resize+pad+normalize path using SIMD (wide) plus an LRU-cached LUT (lru) to reduce repeated work.
  • 🧠 Faster detection postprocessing (NMS): moved to zero-copy output slicing and introduced a SIMD-accelerated per-class NMS path to cut allocations and speed up hot loops.
  • 📐 Rectangular inference (--rect) added + enabled by default: dynamically adjusts input shapes to reduce padding (when the ONNX model supports dynamic shapes) for better throughput/latency.
  • 📦 Batch inference support (--batch): batch size is now configurable, and the CLI pipeline was updated accordingly.
  • 🧵 Pipelined decoding + inference: decoding runs in a producer thread feeding a bounded channel, improving throughput for video/streams.
  • 🛠️ CLI & defaults updated to match Ultralytics Python behavior:
    • Default --iou changed to 0.7 (was 0.45) 🎛️
    • Default --max-det is 300 (and builder/docs updated) 🧾
    • Warns when using the default model (like Python) ⚠️
    • --save and --verbose now show defaults more explicitly
  • 🐧 Linux packaging improvement: sets RPATH to $ORIGIN so binaries can find libonnxruntime*.so placed beside the executable (no need to set LD_LIBRARY_PATH) 📦
  • 🖥️ Visualization improvements: viewer now uses original image dimensions and avoids unnecessary resizing for display.
  • 🔧 Release build tuning: switched to fat LTO for potentially better runtime performance (at the cost of longer compile times) 🏗️

🎯 Purpose & Impact

  • 🚀 Much faster end-to-end inference, especially on CPU, due to fewer allocations, SIMD acceleration, and pipelined processing (notably improved preprocess/postprocess times).
  • 📐 Better efficiency on non-square images with rectangular inference (less padding → less compute), while safely falling back to square padding for mixed-size batches.
  • 📦 Easier deployment on Linux: colocating ONNX Runtime shared libraries with the binary “just works” thanks to $ORIGIN RPATH.
  • 🎛️ More Ultralytics-consistent results and UX: updated defaults (iou=0.7, max_det=300, rect=true) align behavior with Ultralytics Python, reducing surprises when switching environments.
  • ⚠️ Potential behavior change: higher default IoU threshold and higher max detections may alter output counts compared to prior versions (but improves consistency with Ultralytics defaults).
📋 Skipped 1 file (lock files, generated, images, etc.)
  • Cargo.lock

Signed-off-by: Onuralp SEZER <onuralp@ultralytics.com>
Signed-off-by: Onuralp SEZER <onuralp@ultralytics.com>
…ments

- Added `#[allow(clippy::struct_excessive_bools)]` to `InferenceConfig` to suppress excessive bool warnings.
- Removed unnecessary logging initialization code in `init_logging`.
- Suppressed unnecessary wraps in the `main` function.
- Enhanced `YOLOModel` with additional Clippy lints for better code quality.
- Optimized image processing in `YOLOModel` by reducing unnecessary allocations and improving data handling.
- Refactored post-processing to use zero-copy techniques and SIMD for faster detection extraction.
- Introduced a new zero-copy preprocessing function to minimize memory allocations during image processing.
- Improved letterbox resizing and bilinear interpolation with SIMD optimizations and LRU caching for X coordinate lookups.
- Cleaned up deprecated code and comments for better readability and maintainability.

Signed-off-by: Onuralp SEZER <onuralp@ultralytics.com>
Signed-off-by: Onuralp SEZER <onuralp@ultralytics.com>
@onuralpszr onuralpszr requested a review from picsalex January 10, 2026 15:05
Signed-off-by: Onuralp SEZER <onuralp@ultralytics.com>
Signed-off-by: Onuralp SEZER <onuralp@ultralytics.com>
…tests

Signed-off-by: Onuralp SEZER <onuralp@ultralytics.com>
Signed-off-by: Onuralp SEZER <onuralp@ultralytics.com>
Signed-off-by: Onuralp SEZER <onuralp@ultralytics.com>
Signed-off-by: Onuralp SEZER <onuralp@ultralytics.com>
@onuralpszr onuralpszr changed the title feat: ✨ Add rectangular inference support rect feat: ✨ Add rectangular inference support rect and speed optimization for pre-processors and post-processors Jan 10, 2026
@codecov
Copy link

codecov bot commented Jan 10, 2026

Codecov Report

❌ Patch coverage is 67.71218% with 175 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/postprocessing.rs 61.70% 90 Missing ⚠️
src/preprocessing.rs 76.96% 47 Missing ⚠️
src/model.rs 65.11% 15 Missing ⚠️
src/cli/predict.rs 64.51% 11 Missing ⚠️
src/source.rs 0.00% 6 Missing ⚠️
src/download.rs 72.72% 3 Missing ⚠️
src/main.rs 0.00% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

onuralpszr and others added 6 commits January 10, 2026 19:35
…ze, adjust IoU threshold, and improve download image path handling

Signed-off-by: Onuralp SEZER <onuralp@ultralytics.com>
…rgo.toml

Signed-off-by: Onuralp SEZER <onuralp@ultralytics.com>
Signed-off-by: Onuralp SEZER <onuralp@ultralytics.com>
…onfig example

Signed-off-by: Onuralp SEZER <onuralp@ultralytics.com>
@onuralpszr onuralpszr merged commit 466710a into main Jan 19, 2026
9 checks passed
@onuralpszr onuralpszr deleted the feat/rect branch January 19, 2026 08:45
@UltralyticsAssistant
Copy link
Member

Merged! Huge thanks @onuralpszr for pushing Inference 0.0.8 forward with a seriously thoughtful set of performance + UX upgrades, and to @picsalex for the valuable contributions and collaboration.

“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.” — Antoine de Saint‑Exupéry

This PR embodies that idea: fewer copies, fewer allocations, smarter defaults, and smoother Linux deployment—resulting in faster end-to-end CPU inference, better throughput with rect + batch, and a CLI that feels more consistent with Ultralytics (including the iou=0.7 and max-det=300 alignment). The SIMD-accelerated pre/postprocessing and pipelined decoding are especially impactful—real, practical speed where it matters.

Appreciate the craftsmanship and attention to real-world usability—this is a big win for everyone building with Ultralytics Inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants