Skip to content

Conversation

@mertalev
Copy link
Contributor

@mertalev mertalev commented Dec 22, 2025

Pull Request Template

Checklist

  • Confirmed that cargo run-checks command has been executed.
  • Made sure the book is up to date with changes in this PR.

Changes

Adds an nms op to burn-vision for detection use-cases. It matches the options of the NonMaxSuppression ONNX op to make it easy to support in later PRs.

Testing

I compared outputs from torchvision and this implementation and confirm they match with the same settings.

Note: I also wrote a CubeCL kernel for GPU acceleration, but try as I might, it was just slower than the CPU SIMD implementation. The data size is not that large for most applications and part of the algorithm is inherently sequential. The best I could get it was ~70% slower than CPU for 800 boxes, 16% slower for 3200 and about the same for 12800. I decided to omit it from the PR because it's just faster to do it on CPU and transfer back. Maybe there are use-cases or scenarios I'm not considering.

cubecl impl

block-parallel

bitmask-parallel

two phase

optimize for cpu

formatting

simplify response

remove kernel

refactor

tensor op for ergonomics

linting, tests
@codecov
Copy link

codecov bot commented Dec 22, 2025

Codecov Report

❌ Patch coverage is 83.60000% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.05%. Comparing base (6551110) to head (a9f9b86).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
crates/burn-vision/src/backends/cpu/nms.rs 78.84% 33 Missing ⚠️
crates/burn-vision/src/ops/base.rs 56.25% 7 Missing ⚠️
crates/burn-vision/src/tensor.rs 83.33% 1 Missing ⚠️

❌ Your project check has failed because the head coverage (69.05%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4246      +/-   ##
==========================================
+ Coverage   69.03%   69.05%   +0.02%     
==========================================
  Files        1409     1411       +2     
  Lines      165879   166130     +251     
==========================================
+ Hits       114519   114729     +210     
- Misses      51360    51401      +41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@AdrianEddy
Copy link

for context:
I had an use case for GPU-based nms, where I needed to run my entire pipeline on the GPU to avoid ANY transfers of data between CPU and GPU except the first jpeg bytes, and last final output. That pipeline ran entirely on the GPU and I've written the NMS with CUDA to achieve that, and it was much faster than doing it in CPU mainly because I avoided any data transfers whatsoever.
Did you consider that the source and output data was already in the GPU for your benchmarks? I think the data transfer itself might be a bottleneck here, so when your source data is on the CPU, the CPU version is faster, just because of that data transfer.

I'm not saying it is, just saying it may still be useful to have the GPU-only NMS, as I have used that myself.

I believe having the both versions available is ideal here, so users can choose which one fits their use case best

@mertalev
Copy link
Contributor Author

The data was on the device being tested at the start, no transfers included. I didn't test the time to transfer the NMS input to CPU and back, though.

I can include the kernel for completeness. I only tested a specific scenario on Apple Silicon, so maybe it's worth it in other cases. I'm sure the kernel has room for optimization as well.

@AdrianEddy
Copy link

AdrianEddy commented Dec 25, 2025

one additional factor to consider here is that on apple silicon the data transfers are much faster because of unified memory. On Linux/CUDA the transfer itself might have a much bigger overhead because it's a separate device

so in that case, even if the NMS algorithm itself on GPU is potentially slower - it may still be beneficial to run the slower algorithm on device but avoid data transfer, than running a faster algorithm on CPU but pay for data transfer

@antimora antimora requested a review from wingertge December 27, 2025 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants