Planning for Fall 2023 #1667

rachitnigam · 2023-08-16T15:20:07Z

rachitnigam
Aug 16, 2023
Maintainer

(Previous vision write up for continuity: #1334)

Goals for Fall '23

With the summer wrapped up, we've completed the implementation of static abstractions and have started seeing Calyx being used out in the wild:

The AMC HLS compiler is using Calyx to build HLS 2.0, and we have a real chance of beating commercial tools not by putting in decades worth of person hours but by building clever language abstractions.
Someone built a Halide to Calyx compiler mostly on their own.
Someone else is using Calyx-generated accelerators in their ARM SoC.
Anshuman's demonstrated that we can quickly reimplement some classic hardware papers using calyx-py.

With all of this, I think we're reaching a critical point and should figure out what things to prioritize to make Calyx an awesome toolchain for broader use. I've written one possible project plan with concrete goals and hoped to align it to the interests of various people working on the project but I'd be interested in hearing other ideas too!

Concrete Goals

Push-button flow from PyTorch kernels to FPGA accelerators
calyx-py: A library for Calyx construction
static abstractions: Have your composition and speed it up too

Theme 1: Strengthening the CIRCT Flow

A lot of the serendipitous usage of Calyx has come from people experimenting with the Calyx flow in the CIRCT compiler.
With AMC being a critical project for Calyx, I think it is well beyond the time for us to seriously start working with the CIRCT infrastructure.
CIRCT is giant and amorphous and so we need to find a concrete goal: let's try to ditch the TVM-to-Calyx flow and start working on getting PyTorch designs working with Calyx using just the Calyx flow.
At a high-level, we need to flesh out the flowing compilation path: pytorch ->* affine -> loopschedule -> calyx.
Building up this capability gives us a serious advantage over basically every existing toolchain: we can compile giant designs and evaluate the impact of our optimizations.
It also provides us with a very concrete set of benchmarks to use in future papers and when selling Calyx to users.

Theme 2: Productive Calyx Generation using `calyx-py`

Not everyone wants to or needs to build whole compiler frontends. In fact, sometimes, it's extremely helpful to build generators for narrow classes of architectures. As Anshuman's recent work demonstrated, we already have the capability to generate complex designs using calyx-py.
I think fleshing out calyx-py in a meaningful way can give us a dramatically new and productive way of building hardware. It is the perfect substrate to experiment with techniques like Exocompilation.
Again, we need to be extremely concrete about what guides us in this process so let's pick some concrete things we already need to do:

Caleb has been thinking about extending the Systolic Array generator to generate arrays that can perform pre- and post-operations, so we can implement whole kernels like the Attention kernel.
Pai has hand-implemented a HiSparse kernel. Can we find a way to implement it using the builder instead and parameterize it in some interesting dimensions?
Anshuman has been pushing the PIFO implementation. What are some architectures we can build for experimenting with it?
Ethan built a Calyx + Filament implementation of a RISC-V core that uses handwritten Calyx code to coordinate various parts of the processors. We should probably reimplement this in calyx-py.
Pollen is full steam ahead. If the short-term focus is still accelerator generation then can we help them build a metaDSL to generate Calyx programs?

Overall, I'm super excited about this theme. If we're clever, we can even use existing Python to C frontends and connect it to the CIRCT-based HLS flow and really supercharge it. I'm also quite certain that, if done well, there is a paper in here.

Theme 3: Push-button FPGA Use

A push-button FPGA flow for Calyx can be another killer app for us: researchers spends months building the expertise to even use synthesis tools (something we already do in a push button manner) and most papers don't even mention FPGA runs.
I think a push-button FPGA flow is more than possible, but it requires concerted effort on our part.
Our AXI generator is powerful but has a lot of hard-coded assumptions in it that need to be fixed before we make it work with arbitrary designs.
Concretely: my suggestion here is that we try to use the kernels generated from themes (1) and (2) and run them on FPGAs in a push button manner.
To do this, we need to take a couple of steps:

Reimplement the AXI generator to produce Calyx instead of Verilog.
Address the hardcoding limitation mentiond in Issues with the AXI implementation #1071.

The Vision

If successful, at the end of the semester, we'll have a shiny push button flow from CIRCT to generate PyTorch kernels, calyx-py to generate state-of-the-art architectures, and have all of it running on an FPGA.
Calyx occupies an interesting place in the ecosystem currently: it is the most developed open-source flow that actually has the potential to be used to build the next generation of accelerator generators.
We can take some very concrete steps to make that happen.

sampsyo · 2023-08-17T18:33:42Z

sampsyo
Aug 17, 2023
Maintainer

Thanks for the awesome overview!! This would be super fun to push forward.

On the builder/generation theme (2): One adjacent way to focus on this (and perhaps turn it into research) would be to aim for compositionality—that is, to address the problem of not simply writing one generator at a time but exposing interfaces so generators can use each other. This could end up somewhere along the lines of #419.

On the theme of FPGA stuff: for more on "can we port our AXI interface into Calyx instead of Verilog," please see #1105. There is a meaty project here in defining abstract accelerator interfaces and separating out the tools to implement them, outlined in #1084. For the full end-to-end "easily run stuff on FPGAs" vision, there are also a few important ancillary things for making this work:

Finish the minor polishing stuff w/r/t the Xilinx toolchain enumerated in Xilinx toolchain #876.
Actually try running all our stuff that currently works on our actual Alveo card.
Build some infrastructure to make the above less of a pain, especially for CI (namely: it's extremely slow and fud can't transparently reuse precious build products such as bitstreams; licensing; and needing physical access to the card for actual execution). This could look like just better fud stuff, or maybe we resurrect some kind of job server for accepting builds and running them on havarti.
Interfaces for off-chip memory, including Project proposal: Board the near-data computing bandwagon via HBM #1106. Making some use of off-chip memory is critical for almost any realistic FPGA accelerator, especially if we're talking about PyTorch models: as a rule, DNNs do not fit in BRAMs.
Actual FPGA-specific optimizations. This could get as broad as we want.
Consider adding Intel/Quartus to prove that we are not overly specialized.

0 replies

chsasank · 2023-08-31T15:25:30Z

chsasank
Aug 31, 2023

Hi folks, one cool thing we can do is to write a compiler from triton's MLIR to calyx. Triton is fast becoming the standard for GPU/NPU computing especially with adoption of it in PyTorch 2.0. It'll be fun to write a compiler. I am interested to pick this up :).

Hoping to get support and advice!

11 replies

rachitnigam Sep 2, 2023
Maintainer Author

Oh cool! Do you know if the tt dialect can be converted into affine? The affine dialect can already be converted into the MLIR calyx dialect.

chsasank Sep 2, 2023

Not sure, gotta look into this. Very likely will have to write the converter if it's possible in the first place.

chsasank Sep 2, 2023

affine is for polyhedral compilers, right? According to the paper, triton is different from this kind of compilation. See the relevant excerpt.

Also the documentation pretty much says triton's design is diff: https://triton-lang.org/main/programming-guide/chapter-2/related-work.html

rachitnigam Sep 2, 2023
Maintainer Author

Right, affine is motivated by the idea of supporting polyhedral but in general represents a restricted subset of iterative code that be efficiently turned into pipelined hardware. We also support compilation to the scf dialect which allows arbitrary programs but those hardware designs cannot be pipelined as well.

I suspect a lot of computations triton actually performs are amenable to a affine representation as well so that's why I suggested looking into it. We have another ongoing MLIR project that is attempting to use the loopschedule dialect from CIRCT to represent both pipelined and unpipelined code together but its not done yet.

chsasank Sep 2, 2023

Ok, baby steps right now 😅. I feel it's better to look at how OpenCL/DPC++/SyCL works and is translated to FPGA code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Calyx Infrastructure

Planning for Fall 2023 #1667

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

The Calyx Infrastructure

Planning for Fall 2023 #1667

rachitnigam Aug 16, 2023 Maintainer

Goals for Fall '23

Concrete Goals

Theme 1: Strengthening the CIRCT Flow

Theme 2: Productive Calyx Generation using calyx-py

Theme 3: Push-button FPGA Use

The Vision

Replies: 2 comments · 11 replies

sampsyo Aug 17, 2023 Maintainer

chsasank Aug 31, 2023

rachitnigam Sep 2, 2023 Maintainer Author

chsasank Sep 2, 2023

chsasank Sep 2, 2023

rachitnigam Sep 2, 2023 Maintainer Author

chsasank Sep 2, 2023

rachitnigam
Aug 16, 2023
Maintainer

Theme 2: Productive Calyx Generation using `calyx-py`

Replies: 2 comments 11 replies

sampsyo
Aug 17, 2023
Maintainer

chsasank
Aug 31, 2023

rachitnigam Sep 2, 2023
Maintainer Author

rachitnigam Sep 2, 2023
Maintainer Author