Compiling PyTorch Models to Calyx #2056

evanmwilliams · 2024-05-24T21:23:36Z

evanmwilliams
May 24, 2024

Overview

Hi folks! This post is about my M.Eng project from this past semester. I've been working with @jiahanxie353 to compile PyTorch programs into Calyx (and then Verilog so we can run them on real hardware). We've successfully lowered a basic feed-forward neural-network with one hidden layer and a ReLU activation function written in PyTorch into Calyx, in addition to running it on an FPGA. In the rest of this post, I'll describe what we've done so far to make this pipeline work and what remains left in the future. This post extends the discussion here.

Architecture and Implementation

We implemented a pipeline that compiles a simple feed-forward neural-network from PyTorch down into Calyx. Doing this required quite a few tools. First, we used Allo to lower from PyTorch into MLIR. Allo is an accelerator design language (ADL) developed by Professor Zhang's research group which takes machine learning models developed in Python and lowers them directly into MLIR (and various MLIR-dialects, such as Tensor, Linalg, etc). From there, we have multiple passes in MLIR and we use CIRCT to lower into Calyx. From there, we can emit SystemVerilog directly and run the model on an FPGA. Here is a diagram describing the architecture:

To run the generated Verilog designs on an FPGA we used Vivado, AMD's tool for simulating and running FPGA designs. The PyTorch model we were compiling was quite simple. Here is the code:

class Model(nn.Module):
    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.l1 = nn.Linear(input_size, 2)
        self.activation = F.relu
        self.l2 = nn.Linear(2, output_size)

    def forward(self, x):
        output = self.l1(x)
        output = self.activation(output)
        output = self.l2(output)
        return output

Running this through the pipeline took a fair bit of modification to the tools we were using. First, both CIRCT and Calyx did not support floating-point operations. We implemented floating-point operations by using Berkeley's HardFloat library, which contains floating-point arithmetic modules written in Chisel. Chisel can emit Verilog directly, so we simply wrapped the Verilog modules in Calyx constructs and imported them into our Calyx designs.

This caused quite a few issues with the way Calyx manages dependencies and external modules. We resolved this using Morty, a Rust tool that stitches Verilog files together into one giant file that contains all of the dependencies. Invoking Morty required a bit of change in the Calyx backend. Morty takes as input a JSON file that describes all of the dependencies, e.g. something like this:

{
    "include_dirs": [
      "/path/to/include/dir/common_cells/include/",
      "/path/to/include/dir/axi/include/"
    ],
    "defines": {
      "DEFINE_TO_BE_SET": "1"
    },
    "files": [
      "/path/to/file_0.sv",
      "/path/to/file_1.sv",
      "/path/to/file_2.sv"
    ]
  }

Now, instead of searching through the dependencies and trying to stitch them together, the Calyx backend will iterate through the external dependencies and create a JSON file. Then, it invokes Morty and produces the Verilog file with all of the necessary hardware constructs. The code for this can be found here.

After adding support for Morty and floating point modules to Calyx, we turned to the modifications needed in CIRCT. There are a few fundamental ways in which MLIR and Calyx are different, and it impacts how the emitted Calyx code needs to be generated. First, MLIR by default has support for "global memories" (things like the weights of the neural network). Calyx does not have such things, and uses external memories instead. The data for these external memories needs to be put in JSON files. We discovered that if the CIRCT compiler saw MLIR code that referenced these global memories, it simply chose not to emit anything. We fixed this now so that CIRCT will read the MLIR global data, generate a JSON file containing the data, and also emit the corresponding lines of Calyx that use the @external tag. Doing this required adding JSON support to CIRCT, which may or may not be something the maintainers are okay us pushing upstream. We made a PR that is still being reviewed and refined.

Another issue we ran into is that the higher-level dialects of MLIR support multi-dimensional memory accesses, but the lower level dialects (i.e. SCF) do not. In other words, code that looks something like mem[x][y] will not pass through the pipeline. This required us to manually iterate over the MLIR AST nodes and flatten the data. Once the data was flattened, it was actually easier to write to the JSON file that was needed by Calyx. In addition to flattening out the data itself, we also had to flatten out the loops that iterated over them. MLIR supports constructs such as nested loops to iterate over multi-dimensional data, but we instead needed to make all of the loops only one loop deep to support further translations. An interesting point is that Calyx will support these multi-dimensional memories (you can just nest arrays in the data attribute of some memory), so why we have to flatten it in the intermediate representations isn't immediately clear. Perhaps this is something somebody could look into in future phases of this project.

The other major task of our modification with CIRCT was adding support for floating-point add, floating-point multiply, and ReLU (the nonlinearity). Adding support for floating-point operations required creating a new type in CIRCT called ConstantOp. The details can be found more concretely in this PR.

After all of these changes were implemented, we basically had our pipeline! Running this behemoth of a command on the MLIR file produced by Allo will generate Calyx:

./llvm/build/bin/mlir-opt xxx.mlir --empty-tensor-to-alloc-tensor --one-shot-bufferize="allow-return-allocs-from-loops bufferize-function-boundaries" --buffer-results-to-out-params --convert-linalg-to-loops --canonicalize | /build/bin/circt-translate --export-calyx

Here is some of the generated Calyx:

import "primitives/binary_operators.futil";
import "primitives/compile.futil";
import "primitives/core.futil";
import "primitives/float/addFN.futil";
import "primitives/float/mulFN.futil";
import "primitives/memories/comb.futil";
component main<"toplevel"=1,>(@go go: 1) -> (@done done: 1) {
  cells {
    @external(1) ext_mem0 = comb_mem_d1(32, 3, 2);
    @external(1) ext_mem1 = comb_mem_d1(32, 5, 3);
    std_slice_29 = std_slice(32, 3);
    std_slice_28 = std_slice(32, 3);
    std_slice_27 = std_slice(32, 1);
    std_slice_26 = std_slice(32, 2);
    std_slice_25 = std_slice(32, 3);
    std_slice_24 = std_slice(32, 1);
    std_slice_23 = std_slice(32, 1);
    std_slice_22 = std_slice(32, 1);
    std_slice_21 = std_slice(32, 1);
    std_slice_20 = std_slice(32, 1);
...

I've omitted some of the generated code because the file is quite long. Running this code through Fud and targeting the FPGA using Xilinx tools provides our result!

{
  "memories": {
    "ext_mem0": [
      8.60705852508545,
      0.668112576007843,
      6.402494430541992
    ],
    "ext_mem1": [
      -34.6103515625,
      -13.607057571411133,
      22.691490173339844,
      -43.728694915771484,
      -46.849822998046875
...

Discussion and Next Steps

This project was successful in proving the concept that we can indeed translate PyTorch models into Calyx and then Verilog for execution on real hardware. The next steps are to refine the stages of the pipeline. Namely:

Get the Morty integration merged into Calyx. All Calyx programs should be able to use Morty to stitch together dependencies
Get Floating-Point support merged into Calyx. It would be helpful if from now on anyone making Calyx designs can use floating-point operations and data
Package this pipeline together more neatly. Right now, we use Allo (on Professor Zhang's server since it has all of the dependencies installed) to emit MLIR, CIRCT to emit Calyx, and Calyx itself to emit Verilog and then Xilinx to run on the FPGA. Cloning three different repositories and running it through all of the different terminal commands isn't really feasible for people looking to just run the entire pipeline from PyTorch to Verilog

The above three items are more for the sake of completion and cleanliness rather than new research areas or discussions. However, there are many more avenues that this project could go down if there is interest within Capra:

Look into alternatives for floating-point. As I recall the Calyx Numbers project seems to be doing some interesting work here? Perhaps we can write our floating point modules natively in Calyx rather than relying on an external library like Hard-Float. Then we could optimize them more aggressively using numerical linear algebra techniques for machine learning computations
Add support for larger machine learning models. A FFNN works great as a proof of concept but isn't very practical for real machine learning needs. Ideally we'd have support for Convolutional Neural Networks, Recurrent Neural Networks, and even transformers as this project scales
More benchmarking on actual hardware. The generated Verilog is significantly larger than the starting point (which is about 15 lines of Python). This does raise the question of if the performance gains are going to be as significant as we'd like them to be. Running more benchmarks with more interesting machine learning tasks (both training and inference) could tell us how our generated accelerator designs are performing

Overall I really enjoyed working on the project this semester! Huge shoutout to @jiahanxie353 who served as my mentor and partner for the semester. Thanks to @rachitnigam and @sampsyo for the advice early on in the semester about the design and implementation and for helping with the debugging. I believe Jiahan will be giving a demo of this at the Calyx meeting on Monday May 27, so if you want to learn more I strongly encourage you to attend! Further, feel free to leave your thoughts in this discussion thread about the current work and what can be done in the future.

rachitnigam · 2024-05-27T14:38:48Z

rachitnigam
May 27, 2024
Maintainer

Wow wow! This is all really awesome stuff. I'm excited to dig into this deeper but a couple of prodding questions:

How hard was it to get the Xilinx stuff working? Also, did you make sure to compare the results to a software baseline?
I'm excited about getting the Morty integration in Calyx; I think this will make a bunch of other things easier in the Calyx verilog backend easier as well. For example, a common problem we have is conflicting verilog module names. With morty, we can now look at the .futil and .sv files together and only preserve the names that are exposed in the .futil file and change the other names.
"The generated Verilog is significantly larger than the starting point ..." In some ways, we should expect this to be the case. We are not doing any kind of hardware reuse (i.e. using the same conv module for multiple layers) and we are implementing all of the compute from scratch.
A bigger picture question here is what is the next model that we'd like to focus on to help us move towards the goal of more realistic stuff.

2 replies

evanmwilliams May 28, 2024
Author

Hi Rachit! To answer your question about Xilinx, it was a bit of a pain to get working. The documentation on Vivado is sparse and the errors that fud produced tended to not reflect the actual problems. There's a few changes that @jiahanxie353 made to the generated Calyx to get it to run through the flow, which he's working on integrating into the Calyx compiler now for the types that we needed.

We don't have a software baseline because I wasn't sure how to benchmark the Xilinx output - it only produced the data file and none of the stats that Vivado usually produces on hardware usage and timing. That being said I also want to know so I can look into how to benchmark over the next few days

rachitnigam May 29, 2024
Maintainer

Got it! I think getting the results to match a software baseline is critical because there are too many moving parts in a system like this. I think we would want to compare to a known good result before declaring victory on the integration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Calyx Infrastructure

Compiling PyTorch Models to Calyx #2056

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

The Calyx Infrastructure

Compiling PyTorch Models to Calyx #2056

evanmwilliams May 24, 2024

Overview

Architecture and Implementation

Discussion and Next Steps

Replies: 1 comment · 2 replies

rachitnigam May 27, 2024 Maintainer

evanmwilliams May 28, 2024 Author

rachitnigam May 29, 2024 Maintainer

evanmwilliams
May 24, 2024

Replies: 1 comment 2 replies

rachitnigam
May 27, 2024
Maintainer

evanmwilliams May 28, 2024
Author

rachitnigam May 29, 2024
Maintainer