You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been exploring the possibility to port my old CPU SIFT implementation to pure Halide code on a GPU. In my current experiment, as far as the keypoint localization is concerned, I was able to get something like 20x speedup on my macbook air (and I am very happy about that).
There is one improvement I would like to make: the count of local extrema on GPU using Halide.
This amounts to write the equivalent of numpy.count_nonzero or std::count_if before I carry on with stream compaction on the GPU. Which diverges a bit from the use cases that the docs are illustrating.
I wrote a simple generator like this:
// Inside `void generate()`//// `f` is 4D buffer of std::int8_t that store a batch of `n` dense maps of local// scale-space extrema f(x, y, s).//// A position (x, y, s) is marked as:// - `+1` if it is a local scale-space maximum// - `-1` if it is a local scale-space minimum// - `0` otherwise.constauto& w = f.dim(0).extent();
constauto& h = f.dim(1).extent();
constauto& c = f.dim(2).extent();
constauto& n = f.dim(3).extent();
auto r = RDom(0, w, 0, h, 0, c, 0, n);
auto nonzero = select( //f(r.x, r.y, r.z, r.w) != 0, //
std::int32_t{1}, //
std::int32_t{0});
out() = sum(nonzero);
This naive generator was the fastest code I could get actually. I tried using the rfactor following the tutorial but rfactor made things slower. I may be using it wrong.
As a reference, on my Macbook Air, the generated Halide code takes 130 ms to count extrema. The STL function std::count_if takes about 2 ms.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello Halide team,
I have been exploring the possibility to port my old CPU SIFT implementation to pure Halide code on a GPU. In my current experiment, as far as the keypoint localization is concerned, I was able to get something like 20x speedup on my macbook air (and I am very happy about that).
There is one improvement I would like to make: the count of local extrema on GPU using Halide.
This amounts to write the equivalent of
numpy.count_nonzero
orstd::count_if
before I carry on with stream compaction on the GPU. Which diverges a bit from the use cases that the docs are illustrating.I wrote a simple generator like this:
This naive generator was the fastest code I could get actually. I tried using the
rfactor
following the tutorial butrfactor
made things slower. I may be using it wrong.As a reference, on my Macbook Air, the generated Halide code takes
130 ms
to count extrema. The STL functionstd::count_if
takes about2 ms
.I would love your input.
Beta Was this translation helpful? Give feedback.
All reactions