Skip to content

Commit

Permalink
Merge pull request #130 from frasercrmck/vecz-entry-points
Browse files Browse the repository at this point in the history
[compiler] Preserve 'entry point' on vectorized kernels
  • Loading branch information
frasercrmck authored Sep 14, 2023
2 parents 622ee7f + 638632c commit b7bb746
Show file tree
Hide file tree
Showing 12 changed files with 297 additions and 86 deletions.
38 changes: 19 additions & 19 deletions doc/modules/compiler/utils.rst
Original file line number Diff line number Diff line change
Expand Up @@ -315,12 +315,12 @@ Attributes>` function attributes for the outermost loops. The logic for the
dimension unmarshalling lies in
``modules/compiler/utils/include/utils/vecz_order.h``.

Preserving debug info is a problem for the barrier pass due to live variables
getting stored in a struct passed as an argument to each of the generated
kernels. As a result the memory locations pointed to by the debug info are out
of date with respect to newly written values. By specifying the ``IsDebug``
flag when creating the pass we can resolve this problem at the expense of
performance.
Preserving debug info is a problem for the ``WorkItemLoopsPass`` due to live
variables getting stored in a struct passed as an argument to each of the
generated kernels. As a result the memory locations pointed to by the debug
info are out of date with respect to newly written values. By specifying the
``IsDebug`` flag when creating the pass we can resolve this problem at the
expense of performance.

When the ``IsDebug`` flag is set the pass adds a new ``alloca`` which contains a
pointer to the live variables struct of the currently executing work-item, since
Expand Down Expand Up @@ -428,8 +428,8 @@ function, as appropriate.
Work-group scheduling (vectorized and scalar loops)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The Barrier Pass is responsible for stitching together multiple kernels to make
a single kernel capable of correctly executing all work-items in the
The `WorkItemLoopsPass`_ is responsible for stitching together multiple kernels
to make a single kernel capable of correctly executing all work-items in the
work-group.

In particular, when a kernel has been vectorized with :doc:`/modules/vecz` it
Expand All @@ -438,7 +438,7 @@ vectorized dimension is known to be a multiple of the vectorization factor,
there exists the possibility that some work-items will not be executed by the
vectorized loop.

As such, the Barrier Pass is able to stitch together kernels in several
As such, the `WorkItemLoopsPass`_ is able to stitch together kernels in several
different configurations:

* Vector + scalar loop
Expand All @@ -453,8 +453,8 @@ The vector + scalar kernel combination is considered the default behaviour.
Most often the work-group size is unknown at compile time and thus it must be
assumed that the vector loop may not execute all work-items.

This configuration is used if the Barrier Pass is asked to run on a vectorized
function which has :ref:`\!codeplay_ca_vecz.derived
This configuration is used if the `WorkItemLoopsPass`_ is asked to run on a
vectorized function which has :ref:`\!codeplay_ca_vecz.derived
<specifications/mux-compiler-spec:Metadata>` function metadata linking it back
to its scalar progenitor. In this case, both the vector and scalar kernel
functions are identified and are used. The vector work-items are executed
Expand Down Expand Up @@ -504,7 +504,7 @@ only" mode:
:ref:`TransferKernelMetadataPass or EncodeKernelMetadataPass
<encodekernelmetadatapass>` to encode functions with this information.

* If the Barrier pass has been created with the `ForceNoTail` option.
* If the `WorkItemLoopsPass`_ has been created with the `ForceNoTail` option.
* This is a global toggle for *all* kernels in the program.
* If the kernel has been vectorized with vector predication. In this case the
vector loop is known to handle scalar iterations itself.
Expand Down Expand Up @@ -543,17 +543,17 @@ perform all of the scalar iterations at once.
Vector only
^^^^^^^^^^^

If the Barrier Pass is run on a vectorized kernel for which no `vecz` linking
metadata is found to identify the scalar kernel, or if a scalar kernel is found
but one of the conditions listed above hold, then the kernel is emitted using
the vector kernel only. It is assumed that if no scalar kernel is found it is
because targets know that one is not required.
If the `WorkItemLoopsPass`_ is run on a vectorized kernel for which no `vecz`
linking metadata is found to identify the scalar kernel, or if a scalar kernel
is found but one of the conditions listed above hold, then the kernel is
emitted using the vector kernel only. It is assumed that if no scalar kernel is
found it is because targets know that one is not required.

Scalar only
^^^^^^^^^^^

If the Barrier pass is run on a scalar kernel then only the scalar kernel is
used.
If the `WorkItemLoopsPass`_ is run on a scalar kernel then only the scalar
kernel is used.

OptimalBuiltinReplacementPass
-----------------------------
Expand Down
8 changes: 1 addition & 7 deletions doc/modules/vecz/vecz.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,13 +52,7 @@ requirements.

The `vecz::RunVeczPass` does not delete the original scalar kernel after
vectorization, nor does it transfer the scalar kernel name to the vectorized
function. However, the `mux-kernel` attributes will be transferred. This
attribute is present for a kernel and has the special value `"entry-point"` for
a "kernel entry point", which latter is a kernel that is intended to be
directly executed. The barrier pass works only on kernel entry points, so Vecz
will steal entry point status from the scalar kernel and set entry point status
on its own output instead. This behaviour is at time of writing still subject
to change while the details of multiple vectorization are worked out.
function.

## Target specialization

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -136,12 +136,13 @@ barriers into no-ops when the kernel scheduling information and hardware
vectorization capabilities are known when compiling the kernel.

ComputeMux provides an implementation of the compiler-based approach described
above in the form of a 'barrier pass' that can be used by any ComputeMux
target. With this pass it is possible to easily execute compute kernels that
make use of barriers without requiring any synchronization capabilities from
the hardware other than what is already needed to execute barrier-less kernels.
The Host ComputeMux target that executes kernels on the CPU is an example of a
target that uses this barrier pass to support kernels with barriers.
above in the form of a 'work-item loops pass' that can be used by any
ComputeMux target. With this pass it is possible to easily execute compute
kernels that make use of barriers without requiring any synchronization
capabilities from the hardware other than what is already needed to execute
barrier-less kernels. The Host ComputeMux target that executes kernels on the
CPU is an example of a target that uses this work-item loops pass to support
kernels with barriers.

The Vecz Whole-Function Vectorizer
----------------------------------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -131,16 +131,16 @@ llvm::ModulePassManager RefSiM1PassMachinery::getLateTargetPasses() {

PM.addPass(vecz::RunVeczPass());

// Verify that any required sub-group size was met.
PM.addPass(compiler::utils::VerifyReqdSubGroupSizeSatisfiedPass());

addLateBuiltinsPasses(PM, tuner);

compiler::utils::WorkItemLoopsPassOptions WIOpts;
WIOpts.IsDebug = options.opt_disable;
WIOpts.ForceNoTail = env_var_opts.force_no_tail;
PM.addPass(compiler::utils::WorkItemLoopsPass(WIOpts));

// Verify that any required sub-group size was met.
PM.addPass(compiler::utils::VerifyReqdSubGroupSizeSatisfiedPass());

compiler::addPrepareWorkGroupSchedulingPasses(PM);

compiler::utils::AddKernelWrapperPassOptions KWOpts;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -191,16 +191,16 @@ llvm::ModulePassManager {{cookiecutter.target_name.capitalize()}}PassMachinery::

PM.addPass(vecz::RunVeczPass());

// Verify that any required sub-group size was met.
PM.addPass(compiler::utils::VerifyReqdSubGroupSizeSatisfiedPass());

addLateBuiltinsPasses(PM, tuner);

compiler::utils::WorkItemLoopsPassOptions WIOpts;
WIOpts.IsDebug = options.opt_disable;
WIOpts.ForceNoTail = env_var_opts.force_no_tail;
PM.addPass(compiler::utils::WorkItemLoopsPass(WIOpts));

// Verify that any required sub-group size was met.
PM.addPass(compiler::utils::VerifyReqdSubGroupSizeSatisfiedPass());

compiler::addPrepareWorkGroupSchedulingPasses(PM);

compiler::utils::AddKernelWrapperPassOptions KWOpts;
Expand Down
6 changes: 3 additions & 3 deletions modules/compiler/riscv/source/riscv_pass_machinery.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -231,16 +231,16 @@ llvm::ModulePassManager RiscvPassMachinery::getLateTargetPasses() {

PM.addPass(vecz::RunVeczPass());

// Verify that any required sub-group size was met.
PM.addPass(compiler::utils::VerifyReqdSubGroupSizeSatisfiedPass());

addLateBuiltinsPasses(PM, tuner);

compiler::utils::WorkItemLoopsPassOptions WIOpts;
WIOpts.IsDebug = options.opt_disable;
WIOpts.ForceNoTail = env_var_opts.force_no_tail;
PM.addPass(compiler::utils::WorkItemLoopsPass(WIOpts));

// Verify that any required sub-group size was met.
PM.addPass(compiler::utils::VerifyReqdSubGroupSizeSatisfiedPass());

compiler::addPrepareWorkGroupSchedulingPasses(PM);

compiler::utils::AddKernelWrapperPassOptions KWOpts;
Expand Down
6 changes: 3 additions & 3 deletions modules/compiler/targets/host/source/HostPassMachinery.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -261,16 +261,16 @@ llvm::ModulePassManager HostPassMachinery::getKernelFinalizationPasses(

PM.addPass(vecz::RunVeczPass());

// Verify that any required sub-group size was met.
PM.addPass(compiler::utils::VerifyReqdSubGroupSizeSatisfiedPass());

addLateBuiltinsPasses(PM, tuner);

compiler::utils::WorkItemLoopsPassOptions WIOpts;
WIOpts.IsDebug = options.opt_disable;

PM.addPass(compiler::utils::WorkItemLoopsPass(WIOpts));

// Verify that any required sub-group size was met.
PM.addPass(compiler::utils::VerifyReqdSubGroupSizeSatisfiedPass());

PM.addPass(compiler::utils::AddSchedulingParametersPass());

// With scheduling parameters added, add our work-group loops
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,16 @@
;
; SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

; Let vecz pick the right vectorization factor for this kernel check that the
; verification pass correctly notes we've satisifed the required sub-group
; size.
; RUN: env muxc --device "%riscv_device" \
; RUN: --passes run-vecz,verify-reqd-sub-group-satisfied < %s \
; Let vecz pick the right vectorization factor for this kernel, wrap the kernel
; up in a loop, and check that the verification pass correctly notes we've
; satisfied the required sub-group size. This implies that we haven't generated
; a separate work-item loop wrapper for the original kernel, as the constraint
; wouldn't have been satisfied on that.
; RUN: muxc --device "%riscv_device" \
; RUN: --passes run-vecz,work-item-loops,verify-reqd-sub-group-satisfied < %s \
; RUN: | FileCheck %s

; CHECK-LABEL: define void @__vecz_v8_bar_sg8(ptr addrspace(1) %in, ptr addrspace(1) %out) #0 !intel_reqd_sub_group_size !0 !codeplay_ca_vecz.derived !{{[0-9]+}} {
; CHECK-LABEL: define{{( internal)?}} void @__vecz_v8_bar_sg8(ptr addrspace(1) %in, ptr addrspace(1) %out) #0 !intel_reqd_sub_group_size !0 !codeplay_ca_vecz.derived !{{[0-9]+}} {

define void @bar_sg8(ptr addrspace(1) %in, ptr addrspace(1) %out) #0 !intel_reqd_sub_group_size !0 {
%id = call i64 @__mux_get_global_id(i32 0)
Expand Down
3 changes: 3 additions & 0 deletions modules/compiler/test/lit/passes/barriers-v-vp.ll
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ define internal void @foo() !codeplay_ca_vecz.base !2 !codeplay_ca_vecz.base !3
ret void
}

; Check we've stripped this VP kernel of its 'entry point' status, as it hasn't
; been given work-item loops. Check this by checking there aren't any attributes.
; CHECK: define internal void @__vecz_v2_vp_foo() !codeplay_ca_vecz.derived {{\![0-9]+}} {
define void @__vecz_v2_vp_foo() #0 !codeplay_ca_vecz.derived !5 {
ret void
}
Expand Down
Loading

0 comments on commit b7bb746

Please sign in to comment.