feat: nms op #4246

mertalev · 2025-12-22T20:21:02Z

Pull Request Template

Checklist

Confirmed that cargo run-checks command has been executed.
Made sure the book is up to date with changes in this PR.

Changes

Adds an nms op to burn-vision for detection use-cases. It matches the options of the NonMaxSuppression ONNX op to make it easy to support in later PRs.

Testing

I compared outputs from torchvision and this implementation and confirm they match with the same settings.

Note: I also wrote a CubeCL kernel for GPU acceleration, but try as I might, it was just slower than the CPU SIMD implementation. The data size is not that large for most applications and part of the algorithm is inherently sequential. The best I could get it was ~70% slower than CPU for 800 boxes, 16% slower for 3200 and about the same for 12800. I decided to omit it from the PR because it's just faster to do it on CPU and transfer back. Maybe there are use-cases or scenarios I'm not considering.

cubecl impl block-parallel bitmask-parallel two phase optimize for cpu formatting simplify response remove kernel refactor tensor op for ergonomics linting, tests

codecov · 2025-12-22T21:16:36Z

Codecov Report

❌ Patch coverage is 83.60000% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.05%. Comparing base (6551110) to head (a9f9b86).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/burn-vision/src/backends/cpu/nms.rs	78.84%	33 Missing ⚠️
crates/burn-vision/src/ops/base.rs	56.25%	7 Missing ⚠️
crates/burn-vision/src/tensor.rs	83.33%	1 Missing ⚠️

❌ Your project check has failed because the head coverage (69.05%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4246      +/-   ##
==========================================
+ Coverage   69.03%   69.05%   +0.02%     
==========================================
  Files        1409     1411       +2     
  Lines      165879   166130     +251     
==========================================
+ Hits       114519   114729     +210     
- Misses      51360    51401      +41

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

AdrianEddy · 2025-12-25T16:34:12Z

for context:
I had an use case for GPU-based nms, where I needed to run my entire pipeline on the GPU to avoid ANY transfers of data between CPU and GPU except the first jpeg bytes, and last final output. That pipeline ran entirely on the GPU and I've written the NMS with CUDA to achieve that, and it was much faster than doing it in CPU mainly because I avoided any data transfers whatsoever.
Did you consider that the source and output data was already in the GPU for your benchmarks? I think the data transfer itself might be a bottleneck here, so when your source data is on the CPU, the CPU version is faster, just because of that data transfer.

I'm not saying it is, just saying it may still be useful to have the GPU-only NMS, as I have used that myself.

I believe having the both versions available is ideal here, so users can choose which one fits their use case best

mertalev · 2025-12-25T17:20:56Z

The data was on the device being tested at the start, no transfers included. I didn't test the time to transfer the NMS input to CPU and back, though.

I can include the kernel for completeness. I only tested a specific scenario on Apple Silicon, so maybe it's worth it in other cases. I'm sure the kernel has room for optimization as well.

AdrianEddy · 2025-12-25T17:25:11Z

one additional factor to consider here is that on apple silicon the data transfers are much faster because of unified memory. On Linux/CUDA the transfer itself might have a much bigger overhead because it's a separate device

so in that case, even if the NMS algorithm itself on GPU is potentially slower - it may still be beneficial to run the slower algorithm on device but avoid data transfer, than running a faster algorithm on CPU but pay for data transfer

add nms op

0a001ae

cubecl impl block-parallel bitmask-parallel two phase optimize for cpu formatting simplify response remove kernel refactor tensor op for ergonomics linting, tests

tweaks

a9f9b86

antimora requested a review from wingertge December 27, 2025 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: nms op #4246

feat: nms op #4246

Uh oh!

mertalev commented Dec 22, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 22, 2025 •

edited

Loading

Uh oh!

AdrianEddy commented Dec 25, 2025

Uh oh!

mertalev commented Dec 25, 2025

Uh oh!

AdrianEddy commented Dec 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: nms op #4246

Are you sure you want to change the base?

feat: nms op #4246

Uh oh!

Conversation

mertalev commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Template

Checklist

Changes

Testing

Uh oh!

codecov bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

AdrianEddy commented Dec 25, 2025

Uh oh!

mertalev commented Dec 25, 2025

Uh oh!

AdrianEddy commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mertalev commented Dec 22, 2025 •

edited

Loading

codecov bot commented Dec 22, 2025 •

edited

Loading

AdrianEddy commented Dec 25, 2025 •

edited

Loading