Calyx 2.0: Static Everywhere #1334

rachitnigam · 2023-01-03T14:09:27Z

rachitnigam
Jan 3, 2023
Maintainer

Calyx's latency-insensitive abstraction is powerful---you can write programs without having to worry about the timing behavior of things and the compiler attempts to generate designs with good performance for you.

In the three years of its existence, we've mostly focused on the kinds of resource optimizations and language constructs that Calyx's abstractions enable. We've added the comb, invoke, ref, and @sync constructs to the language and explored optimizations like generalized sharing and unsharing explored along with traditional software-like optimizations such as constant propogation, inlining, and unrolling.

However, the work on Filament has opened my eyes to two fundamental limitations of the Calyx ecosystem as it exists today

Performant hardware almost invariably makes uses of latency-sensitive optimizations.
There is no way for Calyx to efficiently interface with the outside world.

Time and again we've seen that timing-based optimizations beat anything else Calyx's dynamic optimizations can do. This is because the clock is a powerful abstraction in synchronous hardware design because it makes things like synchronization "free". More importantly, however, the clock is always there---Calyx does not target any hardware that does not have a clock. This means not using the clock will always leave performance on the table.

Second, the clock is a necessary abstraction to interact with the outside world. Interface for hardware modules are often defined in terms of cycle-level behavior and not being able to explicitly talk about the clock in Calyx program means that we have no way to interface with such modules---we have to wrap to the Verilog in some latency-insensitive interface and use that. Furthermore, things like #1274 are much harder to do without a clock.

Given this, here is a proposed guiding principle for Calyx 2.0: Take compositional, latency-insensitive descriptions of computations and turn them into performant latency-sensitive designs.

Driving Frontend

Another thing that's been lacking with Calyx is a driving frontend---one that pushes the compiler and language design forward in order to support real state of the art accelerator design. Our existing frontends, while numerous, are not very competitive with existing HLS or research-grade accelerator design languages. Much like clang drove the design of LLVM, we need to go all in on a frontend that needs Calyx to generate good designs.

The two main candidates for this are

@andrewbutt1999's AMC frontend. Supporting AMC well requires us to build a good interfacing and static scheduling story because it competes directly with HLS.
Pollen, which is supposed to be a full-blown accelerator generator. However, pollen is very much in the exploratory phase and its not clear to me how it compares against something like an HLS-based approach.

Pipelining

A sure shot way of making sure that we can at least expressive high-performance designs in Calyx is adding pipelining in the language. Being competitive with HLS requires Calyx designs to be able to express latency-sensitive pipelining in some capacity. However, it is not clear how to integrate arbitrary pipelines into Calyx due to some specific problems:

Lack of precise static scheduling: since we treat static scheduling as an optimization, there are no guarantees that the compiler provides. Guarantees for Static Timing #1331 lays out a direction to remedy this.
No backpressure mechanism: This is a thorny issue that is stopping Calyx from expressing composition of pipelines. In real dyanmic pipelines, a consumer can stall a producer by not accepting data. This is not possible in Calyx because we do not have a notion of backpressure. Furthermore, it is not obvious how to add backpressure into the language since it would require providing a backpressure semantics to individual control operators. This is a hard problem that I don't have a good solution for.

Virtual Operators

An orthogonal design axis that has shown up is the distinction between virtual and physical operators. For example, #1175 shows that HLS tools delay deciding what timing properties a multiplier should exactly have. Similarly, #1151 proposes separating the physical choices for memories from the logical operations they perform (AMC takes this idea much further).

In general, it seems that we would want frontend to use "virtual" operators that use latency-insensitive interfaces to schedule computations and then have the compiler decide how to implement these virtual operators. Of course, the true power of this idea shows up when the compiler also has visibility into the pipelined behavior of these operators so it can, for example, decide what II a loop needs to be pipelined at.

The Path Forward

Calyx is a big enough project that I don't envision a full rewrite of any form to support the above features. Instead, we need to take a gradual approach to supporting these features. In the short term, we have a set of proposals that we can work on:

Reimplementing static timing compilation. This is a crucial one to get AMC designs working and (hopefully) somewhat competitive with HLS.
Implement the static control operator. This will give us really precise control over exact scheduling of computations.
Design an MVP for (static) pipelines in Calyx. We don't have to support backpressure and everything else but we do need something realistic.

Along with the implementation of these proposals, we need to evaluate and orient the compiler's design around them. For example:

Implement compiler passes to transform dynamic @sync into static (Support sync without std_sync_reg #1333)
Implement realistic, pipelined systolic arrays

Other Proposals

I think the above lays out a useful, wholesale vision for Calyx 2.0. However, some other proposals are worth mentioning:

`@sync` subsumes `par`

The par operator in Calyx implements fork-join parallelism. However, the @sync operator is much more general and can subsume par. We might want to still keep par around because it is easier to reason about but at some point in the compiler middle-end, we should canonicalize par into @sync.

Indexed IR

Switching from a pointer-based IR to an index-based IR can be useful for many performance reasons, especially for tools like the interpreter (#1183).

A specific approach to do this is implementing components so that they keep track of all cells and ports defined within them using a contiguous array. A cell is represented by the index into the cells array and ports are represented by the index into the ports array. The cell data structure, instead of tracking the ports directly, simply has a range of indices into the ports array:

struct Port { ... };
struct PortIdx(usize);
struct Cell {
    name: ir::Id,
    ports: (/* start: */ PortIdx, /* end: */ PortIdx)
}
struct CellIdx(usize);
struct Component {
    cells: Vec<Cell>,
    ports: Vec<Port>,
}
impl Component {
    fn get_cell(&self, name: ir::Id) -> CellIdx { ... }
}

The benefit of this approach is that iterating over all ports and cells is very cheap. Furthermore, equality checks on cells and ports are also cheap. The interpreter can easily use this representation to have a flattened state of the instance tree.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Calyx Infrastructure

Calyx 2.0: Static Everywhere #1334

{{title}}

Replies: 0 comments

Select a reply

The Calyx Infrastructure

Calyx 2.0: Static Everywhere #1334

rachitnigam Jan 3, 2023 Maintainer

Driving Frontend

Pipelining

Virtual Operators

The Path Forward

Other Proposals

@sync subsumes par

Indexed IR

Replies: 0 comments

rachitnigam
Jan 3, 2023
Maintainer

`@sync` subsumes `par`