Replies: 1 comment
-
Consider what's happening to the input buffer. It's a 3x3 stencil, so a 512x20 tile of output reads a 514x22 region of it. This fits in cache. Most of those loaded values get used 9 times while still in cache. If instead we processed one scanline at a time in parallel (so different scanlines may happen on different cores), each loaded value would only be reused 3 times before being evicted from L1. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
hi all -
I've been reviewing the various Halide tutorials that talk about the .tile() scheduling directive. They say it promotes "memory locality". In what way? I guess I'll borrow from the tutorials as an example: the blur3x3 example
In it, the result buffer has the following scheduler:
result.compute_root().tile(x,y,xi,yi,512,20).vectorize(xi, 32).parallel(y)
So, I get the following details from this scheduler:
how does memory locality benefit from this scheme? You're still jumping around memory locations within each tile, and potentially invalidating cache lines, when you change rows.
It would seem to me that doing this is better?
result.compute_root().vectorize(xi, 32).parallel(y)
The tutorials somewhat shed light on this topic, but not enough to quell my confusion about what's going on with the processor, or why it's a better scheduling approach.
Thanks again, Charles.
Beta Was this translation helpful? Give feedback.
All reactions