-
Notifications
You must be signed in to change notification settings - Fork 767
feat: nms op #4246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: nms op #4246
Conversation
cubecl impl block-parallel bitmask-parallel two phase optimize for cpu formatting simplify response remove kernel refactor tensor op for ergonomics linting, tests
Codecov Report❌ Patch coverage is ❌ Your project check has failed because the head coverage (69.05%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #4246 +/- ##
==========================================
+ Coverage 69.03% 69.05% +0.02%
==========================================
Files 1409 1411 +2
Lines 165879 166130 +251
==========================================
+ Hits 114519 114729 +210
- Misses 51360 51401 +41 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
for context: I'm not saying it is, just saying it may still be useful to have the GPU-only NMS, as I have used that myself. I believe having the both versions available is ideal here, so users can choose which one fits their use case best |
|
The data was on the device being tested at the start, no transfers included. I didn't test the time to transfer the NMS input to CPU and back, though. I can include the kernel for completeness. I only tested a specific scenario on Apple Silicon, so maybe it's worth it in other cases. I'm sure the kernel has room for optimization as well. |
|
one additional factor to consider here is that on apple silicon the data transfers are much faster because of unified memory. On Linux/CUDA the transfer itself might have a much bigger overhead because it's a separate device so in that case, even if the NMS algorithm itself on GPU is potentially slower - it may still be beneficial to run the slower algorithm on device but avoid data transfer, than running a faster algorithm on CPU but pay for data transfer |
Pull Request Template
Checklist
cargo run-checkscommand has been executed.Changes
Adds an
nmsop to burn-vision for detection use-cases. It matches the options of the NonMaxSuppression ONNX op to make it easy to support in later PRs.Testing
I compared outputs from torchvision and this implementation and confirm they match with the same settings.
Note: I also wrote a CubeCL kernel for GPU acceleration, but try as I might, it was just slower than the CPU SIMD implementation. The data size is not that large for most applications and part of the algorithm is inherently sequential. The best I could get it was ~70% slower than CPU for 800 boxes, 16% slower for 3200 and about the same for 12800. I decided to omit it from the PR because it's just faster to do it on CPU and transfer back. Maybe there are use-cases or scenarios I'm not considering.