-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rethinking Static Compilation #2207
Comments
Awesome!! Thanks for the really helpful walkthrough, @calebmkim. I admit that, while reading, my instinct was also to think about the "reference thread" idea, and you have made a pretty clear case about why that is probably not a good idea (i.e., the relevant guards can get pretty dang complicated). I also very much get your point about the tree representation for FSMs. You are absolutely right that there is need to distinguish parallel children from sequential children. A plain ol' forest doesn't quite seem to cut it, because we would need to keep track of which |
One detail I elided in this issue is that when I went for option (2) creating a new parent FSM for each thread, I made all of the threads the same length (i.e., I would increase the length of all threads to match the longest running thread). This allows for a simple
This generally works fine for binary: even if one thread takes 50 cycles and you increase its latency to 200 cycles, you are only adding 2 bits. However, this can get really bad for one-hot-encoding: it adds 150 unnecessary bits. The alternative is doing the following (suppose threads 1 and 2 take 50 cycles, and thread 3 takes 200 cycles):
The
Longer-Term ThoughtsThe choice of whether to make all threads the same length happens during static inlining, but the choice of choosing OHE/Binary happens during |
Implements the ideas from #2207. Apologies for the gigantic PR, there are a couple reasons why it is so big: 1) It represents a big change in the compiler 2) I didn't know if it was going to even be worth it to implement these changes to the compiler, so I implemented some improvements to the compilation process that complicated the code (but [improved](https://github.com/orgs/calyxir/discussions/2202#discussioncomment-10014608) results). There are some minor changes to `static_inline.rs` (in particular, inlining `static par` blocks is more complicated now because we can't merge always just merge all threads of a `static par` into the same group). There are some changes to `compile_static.rs`, but the main contribution of this PR is the file `static_tree.rs`. I'm still going to write some tests to make sure I'm getting all edge cases for this new tree-looking FSM compilation.
(Some of this post is duplicated from my lab notebook entry).
Recap of Problem
Suppose we have the following control structure (assume all groups take 1 cycle):
FSMs are currently implemented like this:
As you can see, because the parent does not stop counting when it offloads to the child, it can drastically increase the size of the parent register.
I implemented FSMs so that the parent pauses when it offloads to the child:
Note that, unlike with the existing compilation technique, we need an
iteration
FSM to count the number of iterations of the repeat loop (the existing compilation technique doesn't need to do this: it can look at the parent FSM to know when to stop).Results
Doing this new technique has led to some pretty good resource usage and worst slack results (compare the red line with the orange bars on these graphs: https://github.com/orgs/calyxir/discussions/2202#discussioncomment-10038812). But the tl;dr is that if we intelligently handle
static-par
blocks (see below) we can get results that are pretty clearly better than the existing compilation technique.Challenge: compiling
static par
(warning: lots of low-level details)One of the really nice things about compiling the existing compilation technique is that
static par
s are really easy to compile.Suppose we have the following control structure:
The FSM will look like the following:
The nice thing is that each child FSM easily knows exactly when to start since it can refer to the
parent FSM
. For example, the first child needs to start on cycle 5, so it starts when the parent FSM equals 5. The second child needs to start on cycle 22, so it starts when the parent FSM equals 22.Using the new technique:
The challenge is knowing when the second child should start (and this problem obviously generalizes to the third, fourth, etc. children if they exist too). You can't just tell the second child to start when the FSM==22, since that isn't actually the 22nd cycle (becasue of the FSM stopping at 5 to offload to its children).
The way I see it, there are two options. At first, I thought of trying to go for option 1, but then quickly switched to option 2.
Option 1: having a "reference" thread.
(This is NOT my chosen implementation, so you should skip this if you only care about what I actually implemented and not my decisions for why I wanted to implement things the way they are.)
Perhaps the first instinct is try a compilation technique that is most similar to the existing technique (i.e., having a single FSM that all children can read from).
The group of {parent FSM, child FSM, iteration FSM} can serve the same function as the previous parent thread. So for our example, the second child FSM will be activated from cycles 22 through 70. Unlike the existing technique, in which the guard is a simple
22 <= parent fsm < 70
the guard is going to be a bit more complicated.We need to translate the query
%[22:70]
into an fsm guard. We'll start withparent_fsm >= 6
covers cycles 55->70.But we still need to fill out
<some other condition>
cover cycles 22->55. This corresponds to when the parent fsm is offloading to the child FSM.Now we've covered when the
parent_fsm
is offloading, but we still need<some condition involving the child/iterator>
to capture the start condition exaclty at cycle 22 (corresponding to cycle 17 of the child). Let's add that:Basically, we need the condition to start being true on the 2nd cycle 4th iteration (i.e., when
iteration_fsm==3
andchild_fsm==2
), and continue to be true for the remaining iterations (i.e., wheniteration_fsm >=4
).Obviously this type of complicated guard will hurt resource usage and probably worst slack as well. And if the loop is nested deeper than the single loop in this example, this guard can get even more complicated.
One way to mitigate this would be to having a one bit register
guard_reg
that sets itself high whenparent_fsm == 5 & ((iteration_fsm == 3 && child_fsm ==1)
(child_fsm ==1
because it takes one cycle to write to the register, and we want it to become high whenchild_fsm==2
) and then guarding withguard_reg.out
instead... but I'm not sure if this would help or not.I ended up not implementing things using this technique a) because I thought it would be too difficult to implement and b) it might not even be the best in terms of resources/worst slack. But I think it's possible that this actually may be the best choice.
Option 2: creating a new parent FSM for each thread
Basically, we create a new parent FSM for each thread in the
static par
block. Similar to how we compile dynamicpar
blocks, where we basically fire off each thread independently, and then wait for them all to finish.Kind of like this:
This will obviously increases register usage, but I'm hopeful that it will decrease logic (LUT) usage since the guards should be simpler.
Also, this is easier to implement since "fire off each thread and wait for them to finish" can be implemented using the same offloading abstraction used to offload the
static repeat
bodies (see next section).What IR abstractions are useful for implementing this?
This FSM generation can get tricky, so you clearly need some sort of abstraction to help you (for example, none of my examples so far even discussed nested loops, which adds another layer of complication). I settled on a tree structure (although I don't know if that's the best abstraction). This makes handling nested loops much easier (one layer of nesting in the loop <-> one layer in the tree).
Each node in the tree has a body latency and number of repeats associated with it, and each child represents an FSM that the parent FSM has offloaded execution to.
For example, if we had the following control:
We'll get the following tree:
What abstraction should we use for
static par
?The one weird thing about the abstraction is that each child can be a 1) a single tree or 2) a group of trees (this is for
static par
blocks). The specific designation of "a group of trees" is so that I can distinguish between children that need to execute at the same time vs. children that do not execute at the same time.So for example if you had:
The tree would look like:
Code
The code is on the
static-repeat-compilation
branch: https://github.com/calyxir/calyx/tree/static-repeat-compilation.It is very messy and not at all readable. Honestly I would not recommend trying to read the code. I will work on improving readability this week.
The text was updated successfully, but these errors were encountered: