You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This proposal was previously maintained as a private note. I am moving it here for archival/reference purposes. The work on this project has already begun and is tracked in #1733.
Like a lot of other nominally-FPGA-focused projects, Calyx has historically been evaluated mostly in simulation. We don't often make things run "for real" on actual FPGAs, despite having some hardware that would be well suited for this in the lab.
This proposal is to finally normalize real execution on actual FPGAs for Calyx, by making it practical and even easy. The vision has two big pieces:
Revitalize the AXI interfaces that are required for host control. Our big idea here is to implement the AXI interfaces in Calyx, instead of with our current ad hoc compiler backend that generates raw Verilog (Port (some of) the AXI wrapper to Calyx #1105). Part of this step will involve standardizing a format for describing accelerator interfaces that can be implemented in multiple ways.
Build the tools necessary to interact with off-chip memories: namely, we're most excited about HBM, of the kind that's on our Xilinx U50 that we have in the lab (Project proposal: Board the near-data computing bandwagon via HBM #1106). Using off-chip memory is an absolute requirement for realistic FPGA-based accelerators, and it also creates the basis for an interesting research project in its own right around generating "near-memory" architectures.
There is also some related infrastructure work to do around fixing bugs, improving testing, and adding features to our AXI interface stuff.
This proposal envisions a sequence of phases for this work.
Phase 0: Clean Up & Prepare
It would be great to start by getting organized w/r/t where we currently stand with Calyx's infrastructure for running on Xilinx FPGAs via a generated AXI interface. Issues with the current tooling are described in #1701 and #876 (and perhaps elsewhere); the goal for this phase would be to take stock of exactly what is working and what is not. There is some inherent uncertainty here, but hopefully we can end up with a coherent list of bugs. We should then classify these bugs into two categories: "fix now" (as in, using the current infrastructure) and "hopefully obviate" (as in, bugs that should just disappear when we rebuild the whole AXI thing as described below, so there's no need to fix them now).
One awesome outcome from this phase would be showing a single accelerator running on the Alveo U50 card we have, hopefully using a single command (even if it takes a while).
This phase will hopefully be short; to the extent that it's not, it might overlap with the next phase.
Phase 1: Accelerator Interface Descriptions
As envisioned in #1084, our goal in this project is to separate AXI generation from the core Calyx compiler. In fact, it goes beyond just AXI: we would also like our current "test harness" interface code to be separated in the same spirit, as outlined in #1603. This kind of separation needs an interface definition language (IDL) specification that describes the "shape" of the interface required.
The idea is to come up with a simple format, probably some kind of JSON blob, for describing the input/output interface of an accelerator. The Calyx compiler would generate this description by looking at the program's code, and various interface tools would consume the description and generate more Calyx code to implement the interface.
Concretely, imagine a Calyx program that looks like this:
And then we want our AXI-generation tool to take this JSON file and produce an AXI interface for the thing component, without needing to look back at the Calyx code at all. That means our AXI generator will be a totally standalone, Calyx-generating tool. It could easily be written in Python, using calyx-py, instead of using Rust at all, if we want that. (The point is that it will be like any other Calyx generator---it does not need special privilege w/r/t the Calyx compiler.) It will be fud's job to combine the original Calyx code and the new Calyx code for the AXI interface into one giant Verilog design.
This phase will be about designing the IDL format, convincing ourselves that it suffices for both an AXI interface generator as in #1105 and a readmemh wrapper generator as in #1603, and implementing a Calyx compiler backend that generates that format.
Phase 2: AXI Generator
This phase is about using the IDL from the previous phase to make a new AXI interface generator. This is a standalone tool that reads the JSON descriptions from above and produces Calyx code that is meant to be "linked" with the original design via its ref cells.
The spec here is straightforward ("work as a drop-in replacement for our current Verilog AST-based AXI interface generator"), but it will surely be a lot of work; it's not worth describing the whole process here, I think.
Testing it will also be something of an ordeal; we probably want to further invest in our current Cocotb-based AXI interface tester first and foremost and, once that's all working good, transition over to trying things out with the Xilinx toolchain and xrt.
Phase 3: Testing & CI?
I'm not entirely sure, but we may want an entire phase here about shoring up our testing infrastructure so:
an open-source AXI test harness (like our current Cocotb-based tooling) runs on every commit, for several representative programs
it's easy to run the Xilinx toolchain too on several programs, probably on our own infrastructure (maybe triggered nightly by GitHub actions?)
TK any other desiderata?
Phase 4: Bursty AXI
Now that we have a robust AXI interface generator and the testing infrastructure to make sure it's functionally correct, let's make it fast! Specifically, we'll extend it to support burst transfers, which are AXI's mechanism for streaming through a bunch of data without doing a big ol' handshake between every single element-sized transfer. For our use case (copy a bunch of memory in, then run the accelerator, then copy a bunch of memory out), the pattern is conceptually simple: we can use as large a burst as possible to transfer as big a chunk of each input/output memory as we can at a time. So this phase is some pretty nitty-gritty hacking, hopefully made feasible by our awesome and robust testing infrastructure and the existence of a known-good baseline.
Phase 5: Off-Chip Memory
I'm leaving this phase fairly vague for now because it's somewhat distant, but the idea here is essentially to pursue #1106. Now that we have good off-chip FPGA interfaces, let's put them to use in communicating with DRAM/HBM. I think that actual AXI interfaces will probably be involved in doing this, although I'm not 100% sure; even if it's not literally AXI, it will be something similar. This phase will involve a lot of up-close and personal time with the Xilinx tools and documentation.
Phase 6: Intel
This is also a far-off and vague proposal, but I think we should try to get our actually-run-it-on-FPGA vision working for Intel FPGAs too. The AXI wrappers could work "off the shelf," but we probably need to do a bit more work on fud/etc. to figure out what the host side of the equation looks like.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
This proposal was previously maintained as a private note. I am moving it here for archival/reference purposes. The work on this project has already begun and is tracked in #1733.
Like a lot of other nominally-FPGA-focused projects, Calyx has historically been evaluated mostly in simulation. We don't often make things run "for real" on actual FPGAs, despite having some hardware that would be well suited for this in the lab.
This proposal is to finally normalize real execution on actual FPGAs for Calyx, by making it practical and even easy. The vision has two big pieces:
There is also some related infrastructure work to do around fixing bugs, improving testing, and adding features to our AXI interface stuff.
This proposal envisions a sequence of phases for this work.
Phase 0: Clean Up & Prepare
It would be great to start by getting organized w/r/t where we currently stand with Calyx's infrastructure for running on Xilinx FPGAs via a generated AXI interface. Issues with the current tooling are described in #1701 and #876 (and perhaps elsewhere); the goal for this phase would be to take stock of exactly what is working and what is not. There is some inherent uncertainty here, but hopefully we can end up with a coherent list of bugs. We should then classify these bugs into two categories: "fix now" (as in, using the current infrastructure) and "hopefully obviate" (as in, bugs that should just disappear when we rebuild the whole AXI thing as described below, so there's no need to fix them now).
One awesome outcome from this phase would be showing a single accelerator running on the Alveo U50 card we have, hopefully using a single command (even if it takes a while).
This phase will hopefully be short; to the extent that it's not, it might overlap with the next phase.
Phase 1: Accelerator Interface Descriptions
As envisioned in #1084, our goal in this project is to separate AXI generation from the core Calyx compiler. In fact, it goes beyond just AXI: we would also like our current "test harness" interface code to be separated in the same spirit, as outlined in #1603. This kind of separation needs an interface definition language (IDL) specification that describes the "shape" of the interface required.
The idea is to come up with a simple format, probably some kind of JSON blob, for describing the input/output interface of an accelerator. The Calyx compiler would generate this description by looking at the program's code, and various interface tools would consume the description and generate more Calyx code to implement the interface.
Concretely, imagine a Calyx program that looks like this:
(TK: Unclear if we should be using
ref
cells or@external
; perhaps it's time for@external
to die.)We'd want to run a compiler "backend" on this program to produce a description like this:
And then we want our AXI-generation tool to take this JSON file and produce an AXI interface for the
thing
component, without needing to look back at the Calyx code at all. That means our AXI generator will be a totally standalone, Calyx-generating tool. It could easily be written in Python, using calyx-py, instead of using Rust at all, if we want that. (The point is that it will be like any other Calyx generator---it does not need special privilege w/r/t the Calyx compiler.) It will be fud's job to combine the original Calyx code and the new Calyx code for the AXI interface into one giant Verilog design.This phase will be about designing the IDL format, convincing ourselves that it suffices for both an AXI interface generator as in #1105 and a
readmemh
wrapper generator as in #1603, and implementing a Calyx compiler backend that generates that format.Phase 2: AXI Generator
This phase is about using the IDL from the previous phase to make a new AXI interface generator. This is a standalone tool that reads the JSON descriptions from above and produces Calyx code that is meant to be "linked" with the original design via its
ref
cells.The spec here is straightforward ("work as a drop-in replacement for our current Verilog AST-based AXI interface generator"), but it will surely be a lot of work; it's not worth describing the whole process here, I think.
Testing it will also be something of an ordeal; we probably want to further invest in our current Cocotb-based AXI interface tester first and foremost and, once that's all working good, transition over to trying things out with the Xilinx toolchain and xrt.
Phase 3: Testing & CI?
I'm not entirely sure, but we may want an entire phase here about shoring up our testing infrastructure so:
Phase 4: Bursty AXI
Now that we have a robust AXI interface generator and the testing infrastructure to make sure it's functionally correct, let's make it fast! Specifically, we'll extend it to support burst transfers, which are AXI's mechanism for streaming through a bunch of data without doing a big ol' handshake between every single element-sized transfer. For our use case (copy a bunch of memory in, then run the accelerator, then copy a bunch of memory out), the pattern is conceptually simple: we can use as large a burst as possible to transfer as big a chunk of each input/output memory as we can at a time. So this phase is some pretty nitty-gritty hacking, hopefully made feasible by our awesome and robust testing infrastructure and the existence of a known-good baseline.
Phase 5: Off-Chip Memory
I'm leaving this phase fairly vague for now because it's somewhat distant, but the idea here is essentially to pursue #1106. Now that we have good off-chip FPGA interfaces, let's put them to use in communicating with DRAM/HBM. I think that actual AXI interfaces will probably be involved in doing this, although I'm not 100% sure; even if it's not literally AXI, it will be something similar. This phase will involve a lot of up-close and personal time with the Xilinx tools and documentation.
Phase 6: Intel
This is also a far-off and vague proposal, but I think we should try to get our actually-run-it-on-FPGA vision working for Intel FPGAs too. The AXI wrappers could work "off the shelf," but we probably need to do a bit more work on fud/etc. to figure out what the host side of the equation looks like.
Beta Was this translation helpful? Give feedback.
All reactions