[fud] Get designs to properly execute on FPGA boards: lab notebook #1022

nathanielnrn · 2022-06-07T16:53:36Z

nathanielnrn
Jun 7, 2022
Collaborator

Goal is to create a generalized way for Verilog designs to execute on FPGA boards using Calyx.
Broadly split into the following parts:

Broad Steps

Manually get Verilog to execute on FPGA boards.
Get fud to correctly execute same design on FPGA.
Use fud to correctly execute systolic array on FPGA.
Generalize execution to work on arbitrary designs.

This issue will also be a place for weekly high level notes and updates consisting of the following:

What have you been up to this week?
What questions, if any, do you have based on that work?
What are your plans for next week?

nathanielnrn · 2022-06-08T16:49:20Z

nathanielnrn
Jun 8, 2022
Collaborator Author

Wednesday June 8:
Currently working through tasks in #1020.

This morning was spent setting things up. Finally was able to execute dot product on havarti's fpga. Output still 0.
In the process I messed up and installed things in the wrong place on Havarti, hope to fix that with the help of @sampsyo

Next steps should be writing and AXI interface manually. Have a feeling next few days will be spent familiarizing myself with AXI.
I'm left wondering if I should complete the manual implementation of things (again #1020) and then compare it to fud's process, or finish one step, then compare it to fud's step. Not sure it really matters.

The rest of today and the coming few days will likely leave me with a lot of questions (and hopefully progress) regarding AXI.

1 reply

sampsyo Jun 8, 2022
Maintainer

Great question! I don't think it would be cheating (morally or practically) to compare against what fud generates, step by step, instead of waiting until the very end to do a comparison. I'd probably recommend doing that to some degree to make sure you don't veer too far off track with the end-to-end implementation.

I think another good place to follow along with the whole "interfacing via AXI" thing would be Xilinx's tutorials for writing what they call "RTL kernels," which basically means Verilog designs that communicate with the host:
https://xilinx.github.io/Vitis-Tutorials/2021-2/build/html/docs/Hardware_Acceleration/Feature_Tutorials/01-rtl_kernel_workflow/README.html

It references an AXI interface implementation in the examples here:
https://github.com/Xilinx/Vitis-Tutorials/tree/2022.1/Hardware_Acceleration/Feature_Tutorials/01-rtl_kernel_workflow/reference-files/src/IP

nathanielnrn · 2022-06-10T22:51:45Z

nathanielnrn
Jun 10, 2022
Collaborator Author

Friday June 10:

Working on #1020.

Seems like the tasks currently listed there may not be the right way forward.
Specifically

Manually write an AXI interface. See here

It might be more useful to examine the produced waveform from hardware emulation (would be happy to hear others' opinions on this).

The above is the main way forward regarding getting designs to run on fpgas.

Beyond that, recently have been working on getting fud to output waveform (.wdb) files during hardware emulation. Currently the process is hardcoded into various python scripts in [fud stages] (https://github.com/cucapra/calyx/tree/master/fud/fud/stages/xilinx). Was wondering how much time should be devoted to making this an easier process? Probably by passing in a -s ... phrase to fud.

Next week I'll be examining the produced waveform, hopefully helping us understand what's going wrong during hardware emulation. Fingers crossed. 🤞

1 reply

sampsyo Jun 11, 2022
Maintainer

Cool! I do agree that examining the waveform trace may be an effective route to understanding what's going wrong with our current setup. Thankfully, things seem to behave the same (broken) way on real FPGA executions and in Xilinx RTL emulation, so this should hopefully reveal the problem. Hacking up a hand-rolled AXI interface could be good for learning more about how things work in here in general, but it seems like possibly a longer road to identifying the problem (which is not to say that won't need to turn to that at some point anyway).

Beyond that, recently have been working on getting fud to output waveform (.wdb) files during hardware emulation. Currently the process is hardcoded into various python scripts in fud stages. Was wondering how much time should be devoted to making this an easier process? Probably by passing in a -s ... phrase to fud.

To map the possible design space:

one-off hack (just change fud to hard-code it to produce waveforms)
config option (fud e foo.xclbin --to fpga -s fpga.waveform true)
some way to produce wdb waveforms as actual fud outputs (fud e foo.xclbin -o foo.wdb)

Basically, I think 3 is long-term better than 2: with option 2, you have to also use the config option to save temporaries to even look at these files, which is a bit of a hack. 3 would be nice because it would be a standard fud way to get the appropriate file out, and we could get rid of the vestigial "emulate" stage altogether (the stage currently named wdb).

But I think 2 is way easier to do than 3 so I think it would be a good idea to put that together real quick. That's just a matter of learning fud's pretty-convenient config system. Adding a new output would be a bit trickier (but possible, one way or another).

nathanielnrn · 2022-06-13T22:39:37Z

nathanielnrn
Jun 13, 2022
Collaborator Author

Monday June 13

Working on #1020

Spent today wrangling with fud stages and trying to get

config option (fud e foo.xclbin --to fpga -s fpga.waveform true)

to work.

My stubbornness probably led me to spend a bit too much time on this, but hopefully in the future I'll learn to let go.

As of right now a properly passed flag creates a directory fud-out-n that holds a bunch of debugging files, along with the desired .wdb and .wcfg

Due to my attempts at making the above work accidentally deleting my initial .wdb file, haven't spent any meaningful time examining the waveforms. That will happen tomorrow and in the following days. Hope to find something useful.

0 replies

rachitnigam · 2022-06-14T03:20:39Z

rachitnigam
Jun 14, 2022
Maintainer

Woo! I’d recommend opening a PR for the new fud options once you have a minimal version working

1 reply

sampsyo Jun 15, 2022
Maintainer

I was going to say the same thing! (But I was too slow; this is already open in #1036.)

nathanielnrn · 2022-06-15T23:18:34Z

nathanielnrn
Jun 15, 2022
Collaborator Author

Wednesday June 15

#1020

Recently created #1036, which does a very minimal job of getting the ,wdb files we can use to debug. Spent the rest of my time staring at said waveform, Vitis documentation, old issues (#853, #958, #367, #876, etc.), and whatever else I could get my hands on.
The waveform of dot product revealed some funky behavior, namely that the data of the 1st axi controller gets left shifted 32 bits every time new data is sent through it. The 2nd axi controller only passes 0's on it's read channel, and the 3rd axi controller is only reading (also al; 0's), even though it should also write.

I'm a bit at a loss for the best direction to pursue from here in order to make concrete progress.

I've contemplated writing an AXI controller from scratch and then generalizing that, but that seems like doing work that's already been done by others (that in general works).
I might stare some more at generated verilog and system verilog files to see if I notice anything, but I feel like I'm not proficient enough for things to pop out.
Ran into the possibility of verifying AXI controllers using a Xilinx provided IP, might be worth looking into?

Basically, there are bits and pieces of things I understand, a whole web of things connecting them that I don't. This in turn makes it hard to know the best way forward from here.

If @sampsyo, @rachitnigam, or anyone else has any recommendations I would be very grateful!

9 replies

sampsyo Jun 16, 2022
Maintainer

Being able to run the whole thing (including the AXI logic) within Verilator would be super awesome... but where would we get the testbench from? We'd need something to play the role of the host side of the AXI conversation. And writing that by hand (communicating with the Verilator simulation API) seems very hard to get right, so we'd want to somehow get our hands on an extant testbench.

sgpthomas Jun 16, 2022
Maintainer

If you haven't done it before, writing an AXI controller from scratch is definitely enlightening and makes it much easier to debug what's going on. so it's a diversion and duplication of work, but I think a good learning tool. Other than that, I've lost context on the current state of this but I'm happy to chat and help in anyway that I can (email me [email protected]).

W.r.t to adrians comment: somewhere, I have some verilator host code that pretends to be an AXI host. I can try and dig it up. though I think it's on a computer I don't have with me at the moment. I think it was pretty stupid, but worked well enough to test things. I found it very difficult to find open source host code that does stuff like this. so you'll probably need something custom

rachitnigam Jun 16, 2022
Maintainer

@nathanielnrn i think this is the host.cpp file I linked to you

sampsyo Jun 16, 2022
Maintainer

Thanks, @sgpthomas!! Nice new avatar, btw. 😃

@rachitnigam, do you mean something like this host.cpp file from git history? If so, I don't think that's what we're after here. All these host files we have written for communication with the FPGA tend to be written against the high-level OpenCL API, which requires Xilinx's XRT driver to translate into AXI signals. What we'd need for this Verilator idea is something Verilator-specific that directly manipulates the AXI signals (no OpenCL involved).

sgpthomas Jun 16, 2022
Maintainer

No I'm thinking of a verilator driver that acts as an AXI master and calls the kernel as an AXI child device (really don't like the slave terminology)

nathanielnrn · 2022-06-18T03:44:37Z

nathanielnrn
Jun 18, 2022
Collaborator Author

Friday June 17

Last two days have been spent reading Xilinx docs regarding axi controllers, vitis, and vivado, and trying to get one of their examples to produce a waveform, probably by using commands similar to the ones at the bottom of this page.

Hope to produce and compare this working waveform to the one generated by fud's simulation of dot product and see if any pattern/problems can be recognized from this comparison.

No real blockers/questions fortunately. Unfortunately, the going is a bit slow, but I hope things will pick up pace on Monday.

4 replies

sampsyo Jun 18, 2022
Maintainer

Neat! All sounds great!

Maybe this was already obvious to you, but if you want that Xilinx example to produce a WDB file, you will need to modify the xrt.ini file that it comes with, here:
https://github.com/Xilinx/Vitis_Accel_Examples/blob/master/rtl_kernels/rtl_vadd/xrt.ini

To use the same config option we use here:
https://github.com/cucapra/calyx/blob/09fa363b24232a1867caa82d50ad96179b741140/fud/fud/stages/xilinx/execution.py#L67

Actually running the example should just entail this make command from the example's Makefile:
https://github.com/Xilinx/Vitis_Accel_Examples/blob/01118b2c0845628fd4a8fb45a8fb2124c3e7accf/rtl_kernels/rtl_vadd/Makefile#L74

sampsyo Jun 18, 2022
Maintainer

I also had two separate thoughts about possible next steps on this journey here:

Beyond staring at waveforms, it might be helpful to enable other kinds of debug tracing from the XRT host library. This reference page lists several options that could be relevant, including opencl_trace and xrt_trace. These might reveal a problem if we're doing something wrong from the host/OpenCL side.
If we want to assume that something is wrong on the host side, one fun mini-project here could be to reimplemented the host side from scratch, as proposed in the first bullet in [fud] Revamp Xilinx fpga stage to work for both emulation and execution #872. I don't think this would be that hard, and we could learn a lot about what we're doing here. Based on @andrewb1999's suggestion in [fud] Get designs to properly execute on FPGA boards: lab notebook #1022 (reply in thread), we could use PYNQ for this—it has a nice-looking XRT wrapper API which might be less fallible than our PyOpenCL route. Even if this "let's just throw it all away and try again" tactic fails to identify the problem, at the very least it will make our fpga stage much more fun to use.

andrewb1999 Jun 18, 2022
Collaborator

As a quick comment here, I did try to run the design with PYNQ and PYNQ never registered the kernel as done running (aka it stalled). This leads me to believe that the ap_done signal is not being asserted correctly or the axilite control conversion is not working correctly for what Vitis expects. A good first goal would be trying to produce an xclbin from Calyx that actually completes execution properly, even if it doesn't actually produce any data output.

It might also be worth it to try using axi stream instead of full axi as a way to get something working. Axi steam is natively supported by vitis/PYNQ and the interfacing is much much easier (just a handshake for each 32 bit element).

sampsyo Jun 20, 2022
Maintainer

Hot tips both; thanks!! Yeah, if the kernel is just running forever, that's worrisome for two reasons: (1) the fact itself, and (2) we didn't detect it with out PyOpenCL invocation.

And thanks for the info about AXI Stream… I didn't know that Vitis/XRT/PYNQ supported this! It could be a much simpler way to go, which would be cool. Maybe we should look into that. It's a bit of a risk-reward thing, since that would be much simpler hardware to realize, but on the other hand, we have kinda-sorta-working implementations for "plain AXI" and the Vitis examples tend to focus on that as well. With the exception of this "mixed kernels" tutorial, for which one component uses AXI Stream to send output back to the host:
https://github.com/Xilinx/Vitis-Tutorials/blob/99efda691610e5a330c1d3ce92ece2cb42b327ab/Hardware_Acceleration/Design_Tutorials/03-rtl_stream_kernel_integration/hw/rtc_gen_kernel.xml#L17

Anyway, I think AXI Stream is worth keeping in mind. But IMO the plan for @nathanielnrn should still be roughly:

Create an isolated environment where everything works, based on the tutorial's RTL and a standalone PYNQ-based host program.
Swap in our current Calyx-generated RTL.
Investigate what is different between the first (working) and the second (non-working).

sampsyo · 2022-06-22T19:54:16Z

sampsyo
Jun 22, 2022
Maintainer

Pair-programming today, we had a grand ol' time trying to get PYNQ to work to run Xilinx's own example today so we have a reference of something that actually works! Namely, our target was to take the Vitis rt_vadd example and to write a PYNQ driver script that correctly invokes it.

We ran into a problem where PYNQ got confused and crashed when it saw that one of the kernel parameters had type ap_uint<32>. PYNQ has an exhaustive list of types it supports here:
https://cs.github.com/Xilinx/PYNQ/blob/59515a9b5de6fad0ff0538bfc50010b16f53c9a8/pynq/overlay.py#L561

…and indeed, ap_uint<32> is not in there. But this is terribly confusing because we don't know where this type is coming from in the example; that string doesn't appear in the example source code. But indeed, when you use the Makefile to build an .xclbin for this example, it does seem to somehow decide that ap_uint<32> is the type of one of the parameters.

So our next steps are:

Just blatantly hack PYNQ to add this type to its list so we can get this example working.
The goal with that is to get two examples working through a standalone PYNQ driver script: one using Xilinx's own example code (hopefully working!), and one using a Calyx-produced xclbin (probably not working).
Then we will have two things we can compare side-by-side. We can inspect the waveforms that come out of the two to see what differs between them. Or we can use other log files, mentioned in [fud] Get designs to properly execute on FPGA boards: lab notebook #1022 (reply in thread) above. In any case, the idea is to compare something working and something non-working side-by-side.

If PYNQ continues to fail us by trying too hard to be easy to use, my next recommendation is to fall back to using PyXRT directly. It's not fancy but it seems to work. There is even a good-looking example of the host code available, albeit for an HLS (not RTL) kernel. The only trouble is that the current version of XRT that's on Havarti is too old to include PyXRT, so we would need to get a newer version. But thankfully, XRT (being open source) is way easier to install than the "main" Xilinx tools like Vitis and stuff.

9 replies

andrewb1999 Jun 22, 2022
Collaborator

My best guess is v++ is inferring the type to be ap_uint<32> because there is no kernel.xml provided in that example. You could try providing one that specified the types of the inputs as uint32_t. Kernel.xml files are generated by HLS so you could write HLS with the same interface as intended by that RTL and then copy over the kernel.xml and provide it to v++.

sampsyo Jun 22, 2022
Maintainer

Yeah, that is a pretty good guess. I am actually surprised it's possible to package up an .xo without a kernel.xml, but your guess that ap_uint<32> is the fallback when it's missing is a good one.

nathanielnrn Jun 22, 2022
Collaborator Author

I'm currently going ahead with modifying the PYNQ installation to support arbitrary precision integers. Hope that leads us to a good place. The fallback-ing is interesting and I might explore the option of copying over a generated kernel.xml file.

Just to add on as a status update: At the end of the day (2022-06-22) it seems like the the provided rtl kernel also tends to hang and not finish. Tomorrow will be spent trying to figure out why.

andrewb1999 Jun 23, 2022
Collaborator

For reference, here is a working RTL kernel with PYNQ: https://github.com/andrewb1999/pynq-ex

sampsyo Jun 23, 2022
Maintainer

Wow; awesome!!

nathanielnrn · 2022-06-24T20:12:08Z

nathanielnrn
Jun 24, 2022
Collaborator Author

Friday June 24

Got PYNQ to work with both calyx generated RTLs and xilinx provided RTLs. In turn spent some time looking at waveforms produced from emulation. Unfortunately the waveforms have not yet proved to be terribly insightful, but I have a feeling I will be revisiting them as I make my way through the examples seen here and comparing them to the generated code calyx produces
main.txt (note that this is a .sv file, github won't let me upload with the correct file extension unfortunately).

Unfortunately the calyx generated code is a bit tough to parse. Any insight anyone has there would be immensely helpful!

Beyond these comparisons, I was surprised that looking at the example given which uses 2 separate adders to compute the sum of 3 vectors both ips have their very own axi controllers. This seems wasteful to me, I would have assumed that the axi controller could be shared, especially because the adders are identical as far as I can tell. (Unless something happens during compilation to optimize this redundancy away?) Could also be related to #853.

This also ties into some confusion from this documentation which is also vector addition (although not necessarily identical to the examples Vitis provides seen above) . Specifically, scrolling down to the RTL viewer images, I don't understand why Axi manager 02 has an adder within it. I would have assumed that the Axi controller would be on the same level as the actual kernel (i.e the adder).

Finally, I found a very friendly blog post that I have a feeling I will be coming back to.

Next week feels like it might be a collaboration heavy one trying to find what exactly might be wrong. I'm also considering simply writing things from scratch, probably with the help of the blog post above, and trying to create something that works from that.

As always I love hearing ideas/thoughts/suggestions on anything and everything and am super appreciative of them!

1 reply

sampsyo Jun 25, 2022
Maintainer

Very cool to have a side-by-side test harness for the two examples, both producing waveforms!

Yeah, that "2 kernels" Vitis example is somewhat odd! It's possible that it's just a contrived/unrealistic example included just to show off a specific pattern: what happens if you have entirely separate kernels, designed completely independently with their own AXI interfaces, and want to stamp them out side-by-side onto your FPGA?

In the direction of comparing waveforms, one big thing that I would try would just be understanding the sequencing of events in the working (i.e., Xilinx example) setup. When do writes happen; how many of them are there; etc.? Can you see the point where the accelerator signals to the host that it is done, and do all the host's "reads" through the AXI interface appear to come after that? Grokking this general workflow as it appears in the working setup would be a good route to understanding how our broken design is falling down. I'm happy to help with this next week!!

You make a good point about an alternative way to spend your time, which could also be productive in the end: implementing stuff from scratch and seeing what you can make work! This could not only be a fun/educational diversion; if it works, you could imagine replacing the current AXI-generation stuff with your own code. If you do this, we could take @andrewb1999's suggestion and try the simpler AXI-Stream variant instead of "OG" AXI or AXI-Lite.

From my perspective, both of these directions could be quite productive and IMO the next step can be up to you… or you can try a little of both.

nathanielnrn · 2022-07-01T23:04:57Z

nathanielnrn
Jul 1, 2022
Collaborator Author

Friday July 1,

This week we pivoted to trying to get a single memory calyx program to work. The generated waveforms indicated a number of errors in the generated verilog code. I've been addressing the issues by manually modifying verilog code and getting them to emulate through Pynq. Eventually this will need to be transferred over to generation code.

Working through the issues incrementally some progress has been made, specifically, the computational kernel (the generated main.sv file) can now correctly accesses the internal bram within the axi controller module to compute 8*4.

Side note: @sampsyo Seems like your theory that mismatched width signals between SystemVerilog and Verilog leads to 'z's bring produced. Fixing the width fixed this and it is now being driven correctly. Additionally, this means that we need to match the width of calyx memory to the width of the generated/expected AXI memory controller.

Additionally, a DONE signal on the bram wasn't being driven and was causing the toplevel not to write anything to memory. This is now fixed. And a series of writes occurs at the end of the iterator's computational sequence. However, the final memory value is still a 0, which is incorrect. However, I feel like we're pretty close.

For those interested, a vcd file of the most recent manually-altered main.sv and toplevel.v files live here.

Here is a screenshot of the relevant part of the simulation.
The first section between the yellow dotted markers is the 32 reads from memory, second section is 8(?) iterative computations and third section is 32 writes to memory that don't work.

At the beginning of next week I will explore @andrewb1999's #1071 and try to make sense of it.

From here we need to see why the writes are still writing 0 to memory, and why both reads and writes occur 32 times. Fixing both of those should bring us pretty close to a working model of the iterator program.

As always, very appreciative and comments, ideas, and help

2 replies

sampsyo Jul 2, 2022
Maintainer

Awesome!! I will engage in more detail after the holiday weekend, but for now I just wanted to say that it is INCREDIBLY COOL that you were able to find these two problems (the missing "done" signal within the BRAM, and the width mismatch that led to Verilog Zs when connecting the kernel's address port to the equivalent in the top-level module). WAHOO!!!!

I agree that figuring out why WDATA remains zero is the next step. I'm not too worried about the multiple reads and writes… we can figure it out, but it feels like a benign "over-estimation" on the part of the generated AXI interface.

This also got me thinking about ways to catch problems like this in the future… some ideas (perhaps) in increasing order of difficulty include:

Figure out how to view the synthesis logs in the v++ step of builds. I would hope that, if we can find the appropriate logs, they would tell us about things like width mismatches in that mem_addr0 signal connection.
Rewrite (parts of) the AXI-wrapper setup in Calyx instead of Verilog. Calyx catches problems like this statically. This could also alleviate the manual construction of the state machine in Toplevel.
Write a testbench that lets us test the AXI-wrapped design with Icarus or Verilator instead of XSIM. This could help reveal errors that Vivado silently ignores. Using cocotb's existing AXI host library could be a good way to go.

rachitnigam Jul 4, 2022
Maintainer

Yeah, this is looking super promising!! I've started using cocotb-based testing for my other research project so happy to help figure out how to test the AXI stuff. I think that is a really good long-term way to ensure that AXI stuff works without needing to hook up to an FPGA toolchain

sgpthomas · 2022-07-06T15:54:24Z

sgpthomas
Jul 6, 2022
Maintainer

Finally back in Austin so I had access to my AXI experiments. I threw what I had in a repo: https://github.com/sgpthomas/axi-playground. You may already be past the need for these things but here they are anyways.

I had a simple verilator AXI driver here. Very basic but maybe something to build off of for testing. It implements a simple state machine using the transaction_id to hold the state.

There's also a barebones axi implementation in that repo: file.

It looks like you already found the ZipCPU resource. That was essential in my understanding of things. The other thing I did was dig through the VivadoHLS generated verilog for some Dahlia programs and look at the interfaces it was generating.

1 reply

sampsyo Jul 7, 2022
Maintainer

Super cool; thanks for sharing, @sgpthomas!!! This obviously represents a pretty heroic degree of hacking. The host-side Verilator test harness doesn't look too too complicated:
https://github.com/sgpthomas/axi-playground/blob/34189ec34b077c5f53c38df1a21b66f06edc3287/testbench.cpp

nathanielnrn · 2022-07-11T14:13:39Z

nathanielnrn
Jul 11, 2022
Collaborator Author

Monday July 11

The past week was spent using @andrewb1999 #1071 to make changes to the verilog generation code, located at #1072.
I also spent a day reading up on rust and trying to familiarize myself with the language a bit more.
I had a lot of small, easy changes to make for some of the problems, and now it's taking time for some of the more complex things.

Specifically, I am currently working on #1078, getting ap_idle signal to correctly work. It has taken me a bit of time to comprehend the use of flags and the difference between write and read ports in the axi address space generator.

Currently thinking about ways to implement ap_idle section, the block we need doesn't seem to work well with any of the flags we currently. Feels wrong to me to have an entire flag that would only be used once for this particular signal, but on the other hand using a new flag would keep with the current design of the generator. Would appreciate any thoughts on this.

My time this week will be spent working through the issues in #1072, hopefully finishing by the end of this week.

0 replies

nathanielnrn · 2022-07-15T23:31:26Z

nathanielnrn
Jul 15, 2022
Collaborator Author

Friday July 15:

Finished #1072 today, thanks to @rachitnigam @sampsyo and especially @andrewb1999.
Happily, as of today calyx generated code correctly runs a dot product using hardware emulation and PYNQ as a host.
Unfortunately, this is because a dot product returns a scalar.

When trying to get generated code to work on calyx's vectorized add only the 0 index of our output array is correctly written to (from PYNQ's perspective).

One thing that is strange is that PYNQ seemed to think it was writing 8 times, implying that it was either writing 0s to certain indexes or writing at the 0th index multiple times (which would be strange considering that the WADDR changes in the waveform below.

For an input of [0,1,2,3,4,5,6,7] and [300,300,300,300,300,300,300,300],
the output we get is [300, 0, 0, 0, 0, 0, 0, 0]

The waveform of the axi-memory-controller for our output vector looks like this (almost everything is decimal):

I briefly talked to @sampsyo about the waveform, and we struggled to see what might be wrong. WDATA is properly assigned, handshakes occur on all channels,

In case anyone is interested,
Here is the generated hardware kernel (needs to be converted to .sv)
Here is the generated toplevel (needs to be converted to .v)
Here is the generated kernel.xml (should be a .xml)

Possible ways forward:

Use xsim to try to find what is being seen as a transaction or not (tried working on this today, didn't get far).
Enable axi-transction-tracing as described in the xilinx xrt.ini documentation
Stare at more waveforms.

Happy to be so close, but still have a bit more work to do.

P.S. @andrewb1999 If you're free and have a chance to look into this that could be super helpful. You did in a day what would have taken me a month, so maybe something will pop out to you. Of course, no worries if not.

As always, happy to hear suggestions/comments/feedback/etc!

8 replies

sampsyo Jul 18, 2022
Maintainer

Here's another random idea. The arguments in the kernel.xml look like this:

<arg name="timeout" addressQualifier="0" id="0" port="s_axi_control" size="0x4" offset="0x010" type="uint" hostOffset="0x0" hostSize="0x4"/>
<arg name="A0" addressQualifier="1" id="1" port="m0_axi" size="0x8" offset="0x18" type="int*" hostOffset="0x0" hostSize="0x8"/>
<arg name="B0" addressQualifier="1" id="2" port="m1_axi" size="0x8" offset="0x20" type="int*" hostOffset="0x0" hostSize="0x8"/>
<arg name="Sum0" addressQualifier="1" id="3" port="m2_axi" size="0x8" offset="0x28" type="int*" hostOffset="0x0" hostSize="0x8"/>

I got suspicious about that size argument. Why is it 8, I asked myself? Is that the number of elements, or the size in bytes of a single element, or something else? I checked the official Xilinx documentation for kernel.xml, and it says that size is defined as:

Size of the argument in bytes. The default is 4 bytes.

That's confusing, because I think our elements are 32 bits (4 bytes). The Xilinx docs do not make it clear whether this should be 4 (the size in bytes of a single element) or 4*8=32 (the size in bytes of the entire array), but it doesn't seem like it should be 8. Maybe it's worth a try just cranking up these sizes to see if anything behaves differently?

andrewb1999 Jul 18, 2022
Collaborator

I just took a look at this and it seems like the issue is that unaligned writes work differently than I expected. https://community.arm.com/support-forums/f/architectures-and-processors-forum/3518/does-an-axi4-master-have-to-assert-the-correct-wstrb-for-unaligned-transfers

The important note here is that axi slaves are allowed to ignore unaligned access information, and therefore wstrb and wdata have to be consistent with the case when the address is aligned to the nearest 512 bit location (same as dropping the bottom bits of the address). So in the current implementation, the valid wdata data is always located at WDATA[31:0], causing the data to get overwritten every time.

This seems like a very weird way for unaligned transfers to work (in particular, it is different from how unaligned reads work), but it is stated clearly in the AXI4 documentation. I think it works this way to simplify bursting, but I'm not quite sure.

Fortunately, there is an easy fix here. I just changed the way WDATA and WSTRB are set as follows:

    assign WDATA = {{15{32'b0}}, bram_read_data} << (send_addr_offset * data_element_width);
    assign WSTRB = {{15{4'h0}}, 4'hF} << (send_addr_offset * (data_element_width / 8));

so for example in this case where the data_element_width is 32 bits I set WDATA and WSTRB as follows:

    assign WDATA = {{15{32'b0}}, bram_read_data} << (send_addr_offset * 32);
    assign WSTRB = {{15{4'h0}}, 4'hF} << (send_addr_offset * 4);

Hopefully that all makes sense. I only tested this in xsim, so Adrian's point about the kernel.xml might also be correct, I haven't tried it to see if it actually makes a difference.

nathanielnrn Jul 18, 2022
Collaborator Author

That's confusing, because I think our elements are 32 bits (4 bytes). The Xilinx docs do not make it clear whether this should be 4 (the size in bytes of a single element) or 4*8=32 (the size in bytes of the entire array), but it doesn't seem like it should be 8. Maybe it's worth a try just cranking up these sizes to see if anything behaves differently?

Noticed this too, could it be a 64-bit pointer to the array we pass in?

nathanielnrn Jul 18, 2022
Collaborator Author

The important note here is that axi slaves are allowed to ignore unaligned access information, and therefore wstrb and wdata have to be consistent with the case when the address is aligned to the nearest 512 bit location (same as dropping the bottom bits of the address). So in the current implementation, the valid wdata data is always located at WDATA[31:0], causing the data to get overwritten every time.

This seems like a very weird way for unaligned transfers to work (in particular, it is different from how unaligned reads work), but it is stated clearly in the AXI4 documentation. I think it works this way to simplify bursting, but I'm not quite sure.

Fortunately, there is an easy fix here. I just changed the way WDATA and WSTRB are set as follows:
    assign WDATA = {{15{32'b0}}, bram_read_data} << (send_addr_offset * data_element_width);
    assign WSTRB = {{15{4'h0}}, 4'hF} << (send_addr_offset * (data_element_width / 8));
so for example in this case where the data_element_width is 32 bits I set WDATA and WSTRB as follows:
    assign WDATA = {{15{32'b0}}, bram_read_data} << (send_addr_offset * 32);
    assign WSTRB = {{15{4'h0}}, 4'hF} << (send_addr_offset * 4);

I'm a bit slow on the uptake, is this the right idea? @andrewb1999
Because our AWADDR starts at 8192 (aligned with 512), but then is incremented by 4 to 8196, when the host is looking for what to write into address 8196 it necessarily looks at WDATA[63:32] writing to 8200 would look at WDATA[95:64], etc? (The strobe mask makes sense to me)

If this is the case, one thing that confuses me is the example provided by Xilinx. Compiling this and simulating gives the following waveform for the toplevel:

Looking at the highlighted signals, the AWADDR doesn't change, even though all of the WDATA gets written to the output buffer during PYNQ simulation. Does this have to do with the burst type in this example being INCR? (As mentioned on
page 48

andrewb1999 Jul 18, 2022
Collaborator

Yeah I think your understanding is correct. That example is different because it uses bursts. The address is only written once and then the internal address is incremented for every WDATA transaction, until WLAST is asserted. The Calyx axi implementation is single-beat, meaning that a single transaction is made at a time and the master waits for that transaction to complete before starting the next one. This simplifies the controller a lot, but bursting can be used in the future to improve throughput.

nathanielnrn · 2022-07-18T21:24:54Z

nathanielnrn
Jul 18, 2022
Collaborator Author

Monday July 18,

As of today, when #1109 is merged, #1072 should be done and we should be able to perform correct hardware emulation for calyx programs (at least the ones we have as examples: vector addition, dot product, and iteration).

Talked with @rachitnigam about steps forward, which is as follows:

Create runt tests for AXI generation Add Runt tests for AXI generation #1107
EDIT: Make Hardware emulation/actual fpga run through fud. Integrate PYNQ into fud stage #1114
Run calyx output on actual hardware
Learn cocotb for the sake of creating a test harness Non-Xilinx AXI test harness #1104

With the eventual goal of having CI testing with cocotb to ensure nothing breaks with our generated AXI code.

Additionally, a parallel existing issue to tackle is #1084, which shouldn't be complex but might be a little tedious. I plan to work on this in spare time/when I need a break from whatever I'm doing

1 reply

sampsyo Jul 18, 2022
Maintainer

Ridiculously cool!! It's awesome that this is working!!

If I may suggest inserting an extra step here between 1 and 2 above, I think there's one thing that would be good for quality of life: making this all work in fud. I think so far you've been manually invoking your PYNQ host program to run a given .xclbin; is that right? It shouldn't be too much effort to integrate this back into fud, just by making fud execute your PYNQ driver script as a subprocess. This would make it super easy to rerun stuff when doing item 3 above. As a bonus, this would address #1037 by automatically setting up the environment variables for that host-code execution.

nathanielnrn · 2022-07-22T23:05:37Z

nathanielnrn
Jul 22, 2022
Collaborator Author

Friday July 22,
Finished #1107 and the more substantial #1114.
Currently working on #1104.

Tried to use a combination of cocotb documentation, cocotbext-axi code, Andrew's advice and Rachit's code to create the outline of a cocotb harness

It is very messy and still very much an outline, I'm still working on figuring out cocotb and trying to understand a "standard", best-practices form of writing these from the examples I linked above. Luckily I'm not really stuck on anything, these things just take me a bit of time.

Next week I am unfortunately not going to be very available, but in general the current plan is to continue working on this until it is finished

1 reply

sampsyo Jul 23, 2022
Maintainer

Awesome!! Indeed; looks like this will take some time to explore.

nathanielnrn · 2022-08-05T14:14:46Z

nathanielnrn
Aug 5, 2022
Collaborator Author

Friday August 5th,

Since last time a bug in creating vcd files was found and fixed #1127.
Additionally, I finally got a version of a cocotb testbench to minimally integrate with our kernel (aka no runtime errors).

Determining the controls signals needed to be sent to the kernel's subordinate-control-module is the next step. This probably will end up tying into #1138. Determining the read/writes to perform on our manager memory-controlers will probably be easier, and similar to setting up the rams that currently appear in the prototype test bench.

I feel like I'm making slow progress, yet can't put my finger on any one thing that is blocking me. Might just be a matter of taking time to understand things better. As always, any suggestions/tips regarding anything here is appreciated.

2 replies

rachitnigam Aug 5, 2022
Maintainer

Sweet! What does the cocotb testbench currently do? Does it not complete a full transaction with the kernel (i.e. send data, run compute, read data)?

nathanielnrn Aug 8, 2022
Collaborator Author

The test bench doesn't test anything meaningful currently. It instantiates some rams that are connected to our toplevel and writes to them. It also contains some helper methods to properly read and write data based on the widths defined in calyx. Beyond that, reimplementing the host manually has not been easy so it will probably take some time to get there

nathanielnrn · 2022-08-11T15:28:44Z

nathanielnrn
Aug 11, 2022
Collaborator Author

Thursday August 11,

Happy to report that some progress has been made. A very messy axi_test.py succeeds in writing the output of our computational kernel to a cocotbext-axi ram.

Lots of stuff in this minimal working version is hard coded, so from here we need to generalize. However I'm hoping that this will end up similar to what I had to do for PYNQ. After having a generalized version working on multiple calyx programs, I will need to integrate into our CI flow. Hoping that's not too difficult.

Happily nothing is blocking me at the moment.

2 replies

rachitnigam Aug 11, 2022
Maintainer

Wooo! Congratulations!! This looks like a great starting point to generalize to something for bigger kernels in the future.

sampsyo Aug 11, 2022
Maintainer

Wow! That is really really cool!!! I don't event think your axi_test.py is all that messy—looks like you've already gotten some abstraction set up, and it's not just one big straight-line mess or anything. Really cool that this works for vadd; let us know how we can help with the generalization!

The Calyx Infrastructure

[fud] Get designs to properly execute on FPGA boards: lab notebook #1022

nathanielnrn Jun 7, 2022 Collaborator

Broad Steps

Replies: 16 comments · 43 replies

nathanielnrn Jun 8, 2022 Collaborator Author

sampsyo Jun 8, 2022 Maintainer

nathanielnrn Jun 10, 2022 Collaborator Author

sampsyo Jun 11, 2022 Maintainer

nathanielnrn Jun 13, 2022 Collaborator Author

rachitnigam Jun 14, 2022 Maintainer

sampsyo Jun 15, 2022 Maintainer

nathanielnrn Jun 15, 2022 Collaborator Author

sampsyo Jun 16, 2022 Maintainer

sgpthomas Jun 16, 2022 Maintainer

rachitnigam Jun 16, 2022 Maintainer

sampsyo Jun 16, 2022 Maintainer

sgpthomas Jun 16, 2022 Maintainer

nathanielnrn Jun 18, 2022 Collaborator Author

sampsyo Jun 18, 2022 Maintainer

sampsyo Jun 18, 2022 Maintainer

andrewb1999 Jun 18, 2022 Collaborator

sampsyo Jun 20, 2022 Maintainer

sampsyo Jun 22, 2022 Maintainer

andrewb1999 Jun 22, 2022 Collaborator

sampsyo Jun 22, 2022 Maintainer

nathanielnrn Jun 22, 2022 Collaborator Author

andrewb1999 Jun 23, 2022 Collaborator

sampsyo Jun 23, 2022 Maintainer

nathanielnrn Jun 24, 2022 Collaborator Author

sampsyo Jun 25, 2022 Maintainer

nathanielnrn Jul 1, 2022 Collaborator Author

sampsyo Jul 2, 2022 Maintainer

rachitnigam Jul 4, 2022 Maintainer

sgpthomas Jul 6, 2022 Maintainer

sampsyo Jul 7, 2022 Maintainer

nathanielnrn Jul 11, 2022 Collaborator Author

nathanielnrn Jul 15, 2022 Collaborator Author

sampsyo Jul 18, 2022 Maintainer

andrewb1999 Jul 18, 2022 Collaborator

nathanielnrn Jul 18, 2022 Collaborator Author

nathanielnrn Jul 18, 2022 Collaborator Author

andrewb1999 Jul 18, 2022 Collaborator

nathanielnrn Jul 18, 2022 Collaborator Author

sampsyo Jul 18, 2022 Maintainer

nathanielnrn Jul 22, 2022 Collaborator Author

sampsyo Jul 23, 2022 Maintainer

nathanielnrn Aug 5, 2022 Collaborator Author

rachitnigam Aug 5, 2022 Maintainer

nathanielnrn Aug 8, 2022 Collaborator Author

nathanielnrn Aug 11, 2022 Collaborator Author

rachitnigam Aug 11, 2022 Maintainer

sampsyo Aug 11, 2022 Maintainer

nathanielnrn
Jun 7, 2022
Collaborator

Replies: 16 comments 43 replies

nathanielnrn
Jun 8, 2022
Collaborator Author

sampsyo Jun 8, 2022
Maintainer

nathanielnrn
Jun 10, 2022
Collaborator Author

sampsyo Jun 11, 2022
Maintainer

nathanielnrn
Jun 13, 2022
Collaborator Author

rachitnigam
Jun 14, 2022
Maintainer

sampsyo Jun 15, 2022
Maintainer

nathanielnrn
Jun 15, 2022
Collaborator Author

sampsyo Jun 16, 2022
Maintainer

sgpthomas Jun 16, 2022
Maintainer

rachitnigam Jun 16, 2022
Maintainer

sampsyo Jun 16, 2022
Maintainer

sgpthomas Jun 16, 2022
Maintainer

nathanielnrn
Jun 18, 2022
Collaborator Author

sampsyo Jun 18, 2022
Maintainer

sampsyo Jun 18, 2022
Maintainer

andrewb1999 Jun 18, 2022
Collaborator

sampsyo Jun 20, 2022
Maintainer

sampsyo
Jun 22, 2022
Maintainer

andrewb1999 Jun 22, 2022
Collaborator

sampsyo Jun 22, 2022
Maintainer

nathanielnrn Jun 22, 2022
Collaborator Author

andrewb1999 Jun 23, 2022
Collaborator

sampsyo Jun 23, 2022
Maintainer

nathanielnrn
Jun 24, 2022
Collaborator Author

sampsyo Jun 25, 2022
Maintainer

nathanielnrn
Jul 1, 2022
Collaborator Author

sampsyo Jul 2, 2022
Maintainer

rachitnigam Jul 4, 2022
Maintainer

sgpthomas
Jul 6, 2022
Maintainer

sampsyo Jul 7, 2022
Maintainer

nathanielnrn
Jul 11, 2022
Collaborator Author

nathanielnrn
Jul 15, 2022
Collaborator Author

sampsyo Jul 18, 2022
Maintainer

andrewb1999 Jul 18, 2022
Collaborator

nathanielnrn Jul 18, 2022
Collaborator Author

nathanielnrn Jul 18, 2022
Collaborator Author

andrewb1999 Jul 18, 2022
Collaborator

nathanielnrn
Jul 18, 2022
Collaborator Author

sampsyo Jul 18, 2022
Maintainer

nathanielnrn
Jul 22, 2022
Collaborator Author

sampsyo Jul 23, 2022
Maintainer

nathanielnrn
Aug 5, 2022
Collaborator Author

rachitnigam Aug 5, 2022
Maintainer

nathanielnrn Aug 8, 2022
Collaborator Author

nathanielnrn
Aug 11, 2022
Collaborator Author

rachitnigam Aug 11, 2022
Maintainer

sampsyo Aug 11, 2022
Maintainer