-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize p_box3x3_f32 function #149
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Daniel Sebastian <[email protected]>
What is your take on removing the first and last row processing? |
Not entirely sure I understand your question, but I extracted the first and last row bits mostly just to avoid evaluating a few 'if' statements that would have been evaluated anyway the vast majority of the time. I didn't check the performance impact of taking that route. |
You can remove many parts of you code as I think only "Process (rows - 4) middle rows" part is needed. (If you don't see why, try to compare the output array with standard implementation.) Can you explain your performance statistics ? It seems the inner block is executed (almost) as many times as the previous implementation, so I think it is even slower ! |
I tried naively removing the 4 extracted loops, and it crashes for me, but even if it didn't I'm pretty sure that making that change without introducing some 'if' statements to the main loop would not be correct; for a 10 column wide image, the first calculations in the main loop would try to store a value to pr[-8]. As for performance, I can understand why you might think it would be slower, but it improves speed mostly by reducing overall memory accesses. For a 10x10 input image, the first iteration of the naive implementation reads values from the input array at positions 0, 1, 2, 10, 11, 12, 20, 21, and 22. The second iteration will read values from positions 1, 2, 3, 11, 12, 13, 21, 22, and 23; 6 values have been re-accessed. In contrast, my implementation does not re-access any input values, and it only requires 3 stores to each output position (compared to 1 for the naive implementation), so overall, my implementation makes fewer memory accesses. |
In case memory access time is much slower then all those operations you introduced in that block, your algorithm works faster. :) Note: I think the number of memory access is higher than 3 as reading values from variables (j, sum, cols in the inner block) are also considered memory access. |
Good point about those extra accesses; I guess I was just thinking about the image arrays. And as a sample for the timings of the naive implementation vs. mine; I did 4 calls on the same 5000x10000 test image, and the naive runs take a total time of 8.4s, and mine takes 2.9s, so those array accesses must not come cheap. ;) |
I think I owe you guys an apology. I did a quick re-implementation, with the extracted loops reintegrated in the form of several scattered if statements; the result was ~10% performance increase (2.76s vs. 2.52s using the same test I mentioned in the last comment). It seems a bit less readable to me, but I'll add a few comments and get that committed ASAP. Sorry about that, guys. =) Edit: On further inspection (and after a bit of bug squashing), switching Qt Creator from Debug to Release builds causes the performance difference between the two versions to disappear (<1% difference). At this point I'm open to contributing whichever version you guys would prefer; just let me know if I should leave it as is, or submit the condensed version. (And if I should submit the new version, do I do that by adding another commit to this pull request, creating a new one, or something else? Sorry; I'm a bit new to GitHub.) |
The new version is mostly the same, with the four outer loops (which handled the first and last rows) now integrated into the center loop, using 'if' statements rather than separate loops to get the desired behavioral changes for the outer rows. Signed-off-by: Daniel Sebastian <[email protected]>
I went ahead and added the new version. It is a bit smaller (helps with that code size minimization requirement), and it varies from indistinguishable performance to 10% better performance, so it's at least as good (if not better than) the other version anyway. |
The current box filter function is replaced with one that takes just over 50% of original running time, and the performance is ~150% that of the OpenCV equivalent (~2/3 of the OpenCV running time).