Scalable vector support #5232
Replies: 2 comments
-
This is very interesting because I was having a hard time imagining how we could support this. What do the schedules and loop extents look like when using this? Do you vectorize by a factor that includes the vscale term? Do the loop extents divide by a vscale term? |
Beta Was this translation helpful? Give feedback.
-
Exactly right on both fronts. So if we have an additive brighten of the form This then generates something like the following Halide IR for the inner loop:
|
Beta Was this translation helpful? Give feedback.
-
Over the past couple of months, I've spent some time looking at and roughly prototyping support for scalable vectors in Halide. This issue serves to report back to the community on what I've found so far, and hopefully prompt a wider discussion around long-term plans for scalable vector support.
At the highest level, what I've essentially been working on is a local fork of the compiler from
60dac030
with a couple of key changes to prove out how some future version of Halide might support vector-length agnostic code generation for Arm's SVE. Particularly, looking to generatevscale
-style LLVM IR from high-level Halide source code. To get this working, I've broadly made changes as follows:halide_type_t
such that thelanes
field takes on a new type, with one bit reserved a newscalable
flag. With this flag set, the number of lanes in the type is specified to be some runtime-determined positive integer multiple of the provided value rather than that value exactly, e.g.vscale x 4
rather than exactly4
.VScale
expression node to Halide's IR, capable of representing the runtime vector length multiplied by some provided integer constant. This is necessary, for instance, to accurately represent scalable loop extents. While I initially tried building this node out purely as a constant, i.e. to represent the number of lanes we have in a machine vector of some type without any further manipulation, I found that the version with an integer multiplication folded into the node was significantly more convenient. Particularly, because we often expect constants in the compiler to have been folded down into a single node, having to peephole optimise over aMul
node forvector_length * n
in a bunch of places that expect a single node constant is a real pain and doesn't play nicely with existing rewrite-based optimisation rules. (After all, we can't fold the expressionvector_length * n
down into anything simpler at compile-time.)Broadcast
andRamp
nodes such that theirlanes
members can represent a scalable number of lanes, tweak optimisation passes and constant folding behaviour, add code generation logic for these constructs, fix up a few places in which the compiler assumes the number of lanes will be known exactly at compile time, etc. I also added some peephole optimisations to generate some SVE intrinsics likeSQADD
, just to see how that might work.All in all, I spent probably two or three weeks looking at this in total, and managed to produce a very basic cobbled-together prototype able to generate scalable code for SVE. To be clear: I don't claim that my approach here is the 'best' way to go about this, nor that it's what the community should do going forward, but just that it's something that worked out for me in trying to prove out what might be possible here. Testing this out in simulation on some fairly simple kernels like an additive brighten, box blur, and Harris corner detector, I found that I was able to generate working SVE code that performed reasonably well. Note that this is without proper support for SVE's predication features, which would probably make up a completely separate unit of work if we wanted to support them (as with mask registers in other architectures). So, for an saturating additive brighten kernel vectorized over
vscale x 16
8-bit integers, for instance, I'm able to produce assembly something like the following:To emphasise, this is very much hacked together – my prototype is not by any means in a complete state, and was written with relatively little care for code quality. Heck, if you try to compile programs with behaviours I haven't tried out yet, there's a good chance you'll hit an assertion and crash, so it certainly isn't suitable for upstreaming at the moment. There's a lot more work to be done here. That being said, I think what I have at the moment is at least somewhat useful for thinking about what SVE support in Halide might look like going forwards.
With respect to the roadblocks I see in getting this kind of support stable and upstreamed, I think the largest I've encountered so far is vector shuffles. Fundamentally, shuffles are represented as fixed-length vectors of indices in both Halide and LLVM today, and for vectors that aren't of a fixed-length known at compile-time this is a real problem. If you're trying to interleave two scalable vectors, for instance, should your index vector be
[0, 4, 1, 5, 2, 6, 3, 7]
or[0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15]
? Well, it depends on the runtime length of the vectors – you want the former for vectors with four lanes, and the latter for vectors with eight. Practically, we probably just want to side-step this issue and generate the right intrinsics for an interleave, but it's not a trivial problem to solve.Thankfully, most other issues I encountered weren't nearly as serious, and seemed to be fairly easily resolved through some minor compiler changes. While I expect there are additional complications with certain scheduling primitives or optimisations on scalable vectors that I haven't run into in the kernels I've examined thus far, in the short term one could simply imagine disabling these for scalable vectors while the longer term solution is being worked out. After all, having any support at all for scalable vectors is likely to be useful for some use cases, and if I can already compile kernels like Harris with something I've hacked together in a few weeks then I think there's real value even in a partial solution here.
Of course, while my focus here has been on SVE, I expect that this kind of functionality will also be useful in targeting other scalable vector architectures. As such, I'm hoping that scalable vectors are something that we might actually see supported in upstream Halide some time in the future. Is this something that's been discussed much before within the community? And do current contributors have any thoughts on the direction that Halide should take in this area going forwards? Happy to have a conversation about more of the details of my investigation if that's helpful.
Beta Was this translation helpful? Give feedback.
All reactions