Merge pull request #130 from frasercrmck/vecz-entry-points

[compiler] Preserve 'entry point' on vectorized kernels
uxlfoundation · Sep 14, 2023 · b7bb746 · b7bb746
2 parents 622ee7f + 638632c
commit b7bb746
Show file tree

Hide file tree

Showing 12 changed files with 297 additions and 86 deletions.
diff --git a/doc/modules/compiler/utils.rst b/doc/modules/compiler/utils.rst
@@ -315,12 +315,12 @@ Attributes>` function attributes for the outermost loops. The logic for the
 dimension unmarshalling lies in
 ``modules/compiler/utils/include/utils/vecz_order.h``.
 
-Preserving debug info is a problem for the barrier pass due to live variables
-getting stored in a struct passed as an argument to each of the generated
-kernels. As a result the memory locations pointed to by the debug info are out
-of date with respect to newly written values. By specifying the ``IsDebug``
-flag when creating the pass we can resolve this problem at the expense of
-performance.
+Preserving debug info is a problem for the ``WorkItemLoopsPass`` due to live
+variables getting stored in a struct passed as an argument to each of the
+generated kernels. As a result the memory locations pointed to by the debug
+info are out of date with respect to newly written values. By specifying the
+``IsDebug`` flag when creating the pass we can resolve this problem at the
+expense of performance.
 
 When the ``IsDebug`` flag is set the pass adds a new ``alloca`` which contains a
 pointer to the live variables struct of the currently executing work-item, since
@@ -428,8 +428,8 @@ function, as appropriate.
 Work-group scheduling (vectorized and scalar loops)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-The Barrier Pass is responsible for stitching together multiple kernels to make
-a single kernel capable of correctly executing all work-items in the
+The `WorkItemLoopsPass`_ is responsible for stitching together multiple kernels
+to make a single kernel capable of correctly executing all work-items in the
 work-group.
 
 In particular, when a kernel has been vectorized with :doc:`/modules/vecz` it
@@ -438,7 +438,7 @@ vectorized dimension is known to be a multiple of the vectorization factor,
 there exists the possibility that some work-items will not be executed by the
 vectorized loop.
 
-As such, the Barrier Pass is able to stitch together kernels in several
+As such, the `WorkItemLoopsPass`_ is able to stitch together kernels in several
 different configurations:
 
 * Vector + scalar loop
@@ -453,8 +453,8 @@ The vector + scalar kernel combination is considered the default behaviour.
 Most often the work-group size is unknown at compile time and thus it must be
 assumed that the vector loop may not execute all work-items.
 
-This configuration is used if the Barrier Pass is asked to run on a vectorized
-function which has :ref:`\!codeplay_ca_vecz.derived
+This configuration is used if the `WorkItemLoopsPass`_ is asked to run on a
+vectorized function which has :ref:`\!codeplay_ca_vecz.derived
 <specifications/mux-compiler-spec:Metadata>` function metadata linking it back
 to its scalar progenitor. In this case, both the vector and scalar kernel
 functions are identified and are used. The vector work-items are executed
@@ -504,7 +504,7 @@ only" mode:
       :ref:`TransferKernelMetadataPass or EncodeKernelMetadataPass
       <encodekernelmetadatapass>` to encode functions with this information.
 
-* If the Barrier pass has been created with the `ForceNoTail` option.
+* If the `WorkItemLoopsPass`_ has been created with the `ForceNoTail` option.
   * This is a global toggle for *all* kernels in the program.
 * If the kernel has been vectorized with vector predication. In this case the
   vector loop is known to handle scalar iterations itself.
@@ -543,17 +543,17 @@ perform all of the scalar iterations at once.
 Vector only
 ^^^^^^^^^^^
 
-If the Barrier Pass is run on a vectorized kernel for which no `vecz` linking
-metadata is found to identify the scalar kernel, or if a scalar kernel is found
-but one of the conditions listed above hold, then the kernel is emitted using
-the vector kernel only. It is assumed that if no scalar kernel is found it is
-because targets know that one is not required.
+If the `WorkItemLoopsPass`_ is run on a vectorized kernel for which no `vecz`
+linking metadata is found to identify the scalar kernel, or if a scalar kernel
+is found but one of the conditions listed above hold, then the kernel is
+emitted using the vector kernel only. It is assumed that if no scalar kernel is
+found it is because targets know that one is not required.
 
 Scalar only
 ^^^^^^^^^^^
 
-If the Barrier pass is run on a scalar kernel then only the scalar kernel is
-used.
+If the `WorkItemLoopsPass`_ is run on a scalar kernel then only the scalar
+kernel is used.
 
 OptimalBuiltinReplacementPass
 -----------------------------

diff --git a/doc/modules/vecz/vecz.md b/doc/modules/vecz/vecz.md
@@ -52,13 +52,7 @@ requirements.
 
 The `vecz::RunVeczPass` does not delete the original scalar kernel after
 vectorization, nor does it transfer the scalar kernel name to the vectorized
-function. However, the `mux-kernel` attributes will be transferred. This
-attribute is present for a kernel and has the special value `"entry-point"` for
-a "kernel entry point", which latter is a kernel that is intended to be
-directly executed. The barrier pass works only on kernel entry points, so Vecz
-will steal entry point status from the scalar kernel and set entry point status
-on its own output instead. This behaviour is at time of writing still subject
-to change while the details of multiple vectorization are worked out.
+function.
 
 ## Target specialization
 

diff --git a/doc/overview/example-scenarios/mapping-algorithms-to-vector-hardware.rst b/doc/overview/example-scenarios/mapping-algorithms-to-vector-hardware.rst
@@ -136,12 +136,13 @@ barriers into no-ops when the kernel scheduling information and hardware
 vectorization capabilities are known when compiling the kernel.
 
 ComputeMux provides an implementation of the compiler-based approach described
-above in the form of a 'barrier pass' that can be used by any ComputeMux
-target. With this pass it is possible to easily execute compute kernels that
-make use of barriers without requiring any synchronization capabilities from
-the hardware other than what is already needed to execute barrier-less kernels.
-The Host ComputeMux target that executes kernels on the CPU is an example of a
-target that uses this barrier pass to support kernels with barriers.
+above in the form of a 'work-item loops pass' that can be used by any
+ComputeMux target. With this pass it is possible to easily execute compute
+kernels that make use of barriers without requiring any synchronization
+capabilities from the hardware other than what is already needed to execute
+barrier-less kernels. The Host ComputeMux target that executes kernels on the
+CPU is an example of a target that uses this work-item loops pass to support
+kernels with barriers.
 
 The Vecz Whole-Function Vectorizer
 ----------------------------------

diff --git a/examples/refsi/refsi_m1/compiler/refsi_m1/source/refsi_pass_machinery.cpp b/examples/refsi/refsi_m1/compiler/refsi_m1/source/refsi_pass_machinery.cpp
@@ -131,16 +131,16 @@ llvm::ModulePassManager RefSiM1PassMachinery::getLateTargetPasses() {
 
   PM.addPass(vecz::RunVeczPass());
 
-  // Verify that any required sub-group size was met.
-  PM.addPass(compiler::utils::VerifyReqdSubGroupSizeSatisfiedPass());
-
   addLateBuiltinsPasses(PM, tuner);
 
   compiler::utils::WorkItemLoopsPassOptions WIOpts;
   WIOpts.IsDebug = options.opt_disable;
   WIOpts.ForceNoTail = env_var_opts.force_no_tail;
   PM.addPass(compiler::utils::WorkItemLoopsPass(WIOpts));
 
+  // Verify that any required sub-group size was met.
+  PM.addPass(compiler::utils::VerifyReqdSubGroupSizeSatisfiedPass());
+
   compiler::addPrepareWorkGroupSchedulingPasses(PM);
 
   compiler::utils::AddKernelWrapperPassOptions KWOpts;

diff --git a/...ookie/{{cookiecutter.target_name}}/source/{{cookiecutter.target_name}}_pass_machinery.cpp b/...ookie/{{cookiecutter.target_name}}/source/{{cookiecutter.target_name}}_pass_machinery.cpp
@@ -191,16 +191,16 @@ llvm::ModulePassManager {{cookiecutter.target_name.capitalize()}}PassMachinery::
 
   PM.addPass(vecz::RunVeczPass());
 
-  // Verify that any required sub-group size was met.
-  PM.addPass(compiler::utils::VerifyReqdSubGroupSizeSatisfiedPass());
-
   addLateBuiltinsPasses(PM, tuner);
 
   compiler::utils::WorkItemLoopsPassOptions WIOpts;
   WIOpts.IsDebug = options.opt_disable;
   WIOpts.ForceNoTail = env_var_opts.force_no_tail;
   PM.addPass(compiler::utils::WorkItemLoopsPass(WIOpts));
 
+  // Verify that any required sub-group size was met.
+  PM.addPass(compiler::utils::VerifyReqdSubGroupSizeSatisfiedPass());
+
   compiler::addPrepareWorkGroupSchedulingPasses(PM);
 
   compiler::utils::AddKernelWrapperPassOptions KWOpts;

diff --git a/modules/compiler/riscv/source/riscv_pass_machinery.cpp b/modules/compiler/riscv/source/riscv_pass_machinery.cpp
@@ -231,16 +231,16 @@ llvm::ModulePassManager RiscvPassMachinery::getLateTargetPasses() {
 
   PM.addPass(vecz::RunVeczPass());
 
-  // Verify that any required sub-group size was met.
-  PM.addPass(compiler::utils::VerifyReqdSubGroupSizeSatisfiedPass());
-
   addLateBuiltinsPasses(PM, tuner);
 
   compiler::utils::WorkItemLoopsPassOptions WIOpts;
   WIOpts.IsDebug = options.opt_disable;
   WIOpts.ForceNoTail = env_var_opts.force_no_tail;
   PM.addPass(compiler::utils::WorkItemLoopsPass(WIOpts));
 
+  // Verify that any required sub-group size was met.
+  PM.addPass(compiler::utils::VerifyReqdSubGroupSizeSatisfiedPass());
+
   compiler::addPrepareWorkGroupSchedulingPasses(PM);
 
   compiler::utils::AddKernelWrapperPassOptions KWOpts;

diff --git a/modules/compiler/targets/host/source/HostPassMachinery.cpp b/modules/compiler/targets/host/source/HostPassMachinery.cpp
@@ -261,16 +261,16 @@ llvm::ModulePassManager HostPassMachinery::getKernelFinalizationPasses(
 
   PM.addPass(vecz::RunVeczPass());
 
-  // Verify that any required sub-group size was met.
-  PM.addPass(compiler::utils::VerifyReqdSubGroupSizeSatisfiedPass());
-
   addLateBuiltinsPasses(PM, tuner);
 
   compiler::utils::WorkItemLoopsPassOptions WIOpts;
   WIOpts.IsDebug = options.opt_disable;
 
   PM.addPass(compiler::utils::WorkItemLoopsPass(WIOpts));
 
+  // Verify that any required sub-group size was met.
+  PM.addPass(compiler::utils::VerifyReqdSubGroupSizeSatisfiedPass());
+
   PM.addPass(compiler::utils::AddSchedulingParametersPass());
 
   // With scheduling parameters added, add our work-group loops

diff --git a/modules/compiler/targets/riscv/test/lit/passes/verify-reqd-sg-size-2.ll b/modules/compiler/targets/riscv/test/lit/passes/verify-reqd-sg-size-2.ll
@@ -14,14 +14,16 @@
 ;
 ; SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 
-; Let vecz pick the right vectorization factor for this kernel check that the
-; verification pass correctly notes we've satisifed the required sub-group
-; size.
-; RUN: env muxc --device "%riscv_device" \
-; RUN:   --passes run-vecz,verify-reqd-sub-group-satisfied < %s \
+; Let vecz pick the right vectorization factor for this kernel, wrap the kernel
+; up in a loop, and check that the verification pass correctly notes we've
+; satisfied the required sub-group size. This implies that we haven't generated
+; a separate work-item loop wrapper for the original kernel, as the constraint
+; wouldn't have been satisfied on that.
+; RUN: muxc --device "%riscv_device" \
+; RUN:   --passes run-vecz,work-item-loops,verify-reqd-sub-group-satisfied < %s \
 ; RUN: | FileCheck %s
 
-; CHECK-LABEL: define void @__vecz_v8_bar_sg8(ptr addrspace(1) %in, ptr addrspace(1) %out) #0 !intel_reqd_sub_group_size !0 !codeplay_ca_vecz.derived !{{[0-9]+}} {
+; CHECK-LABEL: define{{( internal)?}} void @__vecz_v8_bar_sg8(ptr addrspace(1) %in, ptr addrspace(1) %out) #0 !intel_reqd_sub_group_size !0 !codeplay_ca_vecz.derived !{{[0-9]+}} {
 
 define void @bar_sg8(ptr addrspace(1) %in, ptr addrspace(1) %out) #0 !intel_reqd_sub_group_size !0 {
   %id = call i64 @__mux_get_global_id(i32 0)

diff --git a/modules/compiler/test/lit/passes/barriers-v-vp.ll b/modules/compiler/test/lit/passes/barriers-v-vp.ll
@@ -23,6 +23,9 @@ define internal void @foo() !codeplay_ca_vecz.base !2 !codeplay_ca_vecz.base !3
   ret void
 }
 
+; Check we've stripped this VP kernel of its 'entry point' status, as it hasn't
+; been given work-item loops. Check this by checking there aren't any attributes.
+; CHECK: define internal void @__vecz_v2_vp_foo() !codeplay_ca_vecz.derived {{\![0-9]+}} {
 define void @__vecz_v2_vp_foo() #0 !codeplay_ca_vecz.derived !5 {
   ret void
 }