Scalable vector support #5232

joesavage · 2020-09-01T10:24:10Z

joesavage
Sep 1, 2020

Over the past couple of months, I've spent some time looking at and roughly prototyping support for scalable vectors in Halide. This issue serves to report back to the community on what I've found so far, and hopefully prompt a wider discussion around long-term plans for scalable vector support.

At the highest level, what I've essentially been working on is a local fork of the compiler from 60dac030 with a couple of key changes to prove out how some future version of Halide might support vector-length agnostic code generation for Arm's SVE. Particularly, looking to generate vscale-style LLVM IR from high-level Halide source code. To get this working, I've broadly made changes as follows:

I modified halide_type_t such that the lanes field takes on a new type, with one bit reserved a new scalable flag. With this flag set, the number of lanes in the type is specified to be some runtime-determined positive integer multiple of the provided value rather than that value exactly, e.g. vscale x 4 rather than exactly 4.
I added a new VScale expression node to Halide's IR, capable of representing the runtime vector length multiplied by some provided integer constant. This is necessary, for instance, to accurately represent scalable loop extents. While I initially tried building this node out purely as a constant, i.e. to represent the number of lanes we have in a machine vector of some type without any further manipulation, I found that the version with an integer multiplication folded into the node was significantly more convenient. Particularly, because we often expect constants in the compiler to have been folded down into a single node, having to peephole optimise over a Mul node for vector_length * n in a bunch of places that expect a single node constant is a real pain and doesn't play nicely with existing rewrite-based optimisation rules. (After all, we can't fold the expression vector_length * n down into anything simpler at compile-time.)
I made a bunch of assorted changes to plumb these two constructs through the compiler properly. I had to modify the Broadcast and Ramp nodes such that their lanes members can represent a scalable number of lanes, tweak optimisation passes and constant folding behaviour, add code generation logic for these constructs, fix up a few places in which the compiler assumes the number of lanes will be known exactly at compile time, etc. I also added some peephole optimisations to generate some SVE intrinsics like SQADD, just to see how that might work.

All in all, I spent probably two or three weeks looking at this in total, and managed to produce a very basic cobbled-together prototype able to generate scalable code for SVE. To be clear: I don't claim that my approach here is the 'best' way to go about this, nor that it's what the community should do going forward, but just that it's something that worked out for me in trying to prove out what might be possible here. Testing this out in simulation on some fairly simple kernels like an additive brighten, box blur, and Harris corner detector, I found that I was able to generate working SVE code that performed reasonably well. Note that this is without proper support for SVE's predication features, which would probably make up a completely separate unit of work if we wanted to support them (as with mask registers in other architectures). So, for an saturating additive brighten kernel vectorized over vscale x 16 8-bit integers, for instance, I'm able to produce assembly something like the following:

ld1b    {z1.b}, p0/z, [x24, x5] ; Vector load
subs    x25, x25, #0x1          ; Subtract one from the loop counter
sqadd   z1.b, z1.b, z0.b        ; Saturating vector add with some constant vector 'z0'
st1b    {z1.b}, p0, [x23, x5]   ; Vector store
addvl   x5, x5, #1              ; Add one vector length to our load/store offset register
b.ne    610 <brighter+0x610>    ; Repeat

To emphasise, this is very much hacked together – my prototype is not by any means in a complete state, and was written with relatively little care for code quality. Heck, if you try to compile programs with behaviours I haven't tried out yet, there's a good chance you'll hit an assertion and crash, so it certainly isn't suitable for upstreaming at the moment. There's a lot more work to be done here. That being said, I think what I have at the moment is at least somewhat useful for thinking about what SVE support in Halide might look like going forwards.

With respect to the roadblocks I see in getting this kind of support stable and upstreamed, I think the largest I've encountered so far is vector shuffles. Fundamentally, shuffles are represented as fixed-length vectors of indices in both Halide and LLVM today, and for vectors that aren't of a fixed-length known at compile-time this is a real problem. If you're trying to interleave two scalable vectors, for instance, should your index vector be [0, 4, 1, 5, 2, 6, 3, 7] or [0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15]? Well, it depends on the runtime length of the vectors – you want the former for vectors with four lanes, and the latter for vectors with eight. Practically, we probably just want to side-step this issue and generate the right intrinsics for an interleave, but it's not a trivial problem to solve.

Thankfully, most other issues I encountered weren't nearly as serious, and seemed to be fairly easily resolved through some minor compiler changes. While I expect there are additional complications with certain scheduling primitives or optimisations on scalable vectors that I haven't run into in the kernels I've examined thus far, in the short term one could simply imagine disabling these for scalable vectors while the longer term solution is being worked out. After all, having any support at all for scalable vectors is likely to be useful for some use cases, and if I can already compile kernels like Harris with something I've hacked together in a few weeks then I think there's real value even in a partial solution here.

Of course, while my focus here has been on SVE, I expect that this kind of functionality will also be useful in targeting other scalable vector architectures. As such, I'm hoping that scalable vectors are something that we might actually see supported in upstream Halide some time in the future. Is this something that's been discussed much before within the community? And do current contributors have any thoughts on the direction that Halide should take in this area going forwards? Happy to have a conversation about more of the details of my investigation if that's helpful.

abadams · 2020-09-01T16:03:41Z

abadams
Sep 1, 2020
Maintainer

This is very interesting because I was having a hard time imagining how we could support this. What do the schedules and loop extents look like when using this? Do you vectorize by a factor that includes the vscale term? Do the loop extents divide by a vscale term?

0 replies

joesavage · 2020-09-02T13:28:29Z

joesavage
Sep 2, 2020
Author

Exactly right on both fronts. So if we have an additive brighten of the form brighter(x, y) = input(x, y) + amount, we can vectorize that in the x-dimension in a vector-length agnostic way with brighter.vectorize(x, VScale(16)). Ideally, with that VScale(16) term coming from the return value of something like natural_vector_size rather than being specified manually.

This then generates something like the following Halide IR for the inner loop:

for (brighter.s0.x.x, 0, brighter.extent.0/vscale_x_16) {
    brighter[ramp(((brighter.s0.y * brighter.stride.1) + (brighter.s0.x.x * vscale_x_16)), 1, vscale_x_16)] = input[ramp(((brighter.s0.y * input.stride.1) + (brighter.s0.x.x * vscale_x_16)), 1, vscale_x_16)] + vscale_x16((uint8)amount)
}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalable vector support #5232

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Scalable vector support #5232

joesavage Sep 1, 2020

Replies: 2 comments

abadams Sep 1, 2020 Maintainer

joesavage Sep 2, 2020 Author

joesavage
Sep 1, 2020

abadams
Sep 1, 2020
Maintainer

joesavage
Sep 2, 2020
Author