conclusion.tex

In this paper, we have presented a new Halide scheduling primitive called \code{rfactor}, which makes it possible to factor reductions into multiple stages using the schedule alone, with all the concomitant correctness guarantees. This permits parallelization and vectorization of Halide algorithms which previously required manipulating both the algorithm and schedule.

Although our framework is able to handle a broad range of associative reductions, there are several limitations:

\begin{itemize}
\item Our precomputed table can only recognize reductions decomposable into elementary reductions of a fixed maximum size. The number of expression trees grows very quickly as a function of the number of leaves and the number of tuple components, so this approach is never going to handle operations like multiplying a long sequence of 4x4 matrices, each expressed as a 16-tuple. We would need more aggressive decomposition tools, or a more directed runtime search over the space of associative expressions (as opposed to our exhaustive offline enumeration of it).
\item We can only recognize associative operations that Z3 can prove are associative. This is fortunately a large set. One failing example in this category is summing a sequence of 128-bit integers, where each 128-bit integer is represented as a pair of 64-bit integers, and addition is implemented as elementwise addition plus some logic to handle the carry bit.  We plan to explore additional encodings into Z3 to increase the operations it is able to prove associative.
\item We only recognize reductions where the first stage in the factorization -- the \emph{intermediate} -- has the same form as the original update definition. The simplest failing example in this category is repeated subtraction of a list of values from some initial value. It can be manually factored into an \emph{intermediate} that sums slices of the list, and then a \emph{merge} stage that does repeated subtraction, but \code{rfactor} cannot do this transformation.
\end{itemize}

Despite these caveats, our technique can handle all of the associative reductions we have seen in the wild, which are mostly summations of one or more tuple components, minima, maxima, and \code{argmin}-like operations which find some value associated with an extremum.

\code{rfactor} lifts the burden of factoring a reduction from the programmer. This improves readability, as the \emph{algorithm}, which is entirely responsible for \emph{what} values are computed, is shorter. It also improves portability, as a reduction can be factored in different ways on different platforms, without risking producing different results on each platform. However, we suspect the most important benefit of \code{rfactor} is that by moving this factoring into the schedule alone, \code{rfactor} will make it possible for automatic schedule generation tools to parallelize and vectorize reductions.