Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fud2] Hypergraph #1958

Closed
sampsyo opened this issue Mar 5, 2024 · 6 comments
Closed

[fud2] Hypergraph #1958

sampsyo opened this issue Mar 5, 2024 · 6 comments
Labels
C: fud2 experimental driver S: Available Can be worked upon

Comments

@sampsyo
Copy link
Contributor

sampsyo commented Mar 5, 2024

Kinda seems like fud2 should become a hypergraph. It's currently a graph: vertices are states, and edges are operations (transformations between states). But the need for ops with two inputs has come up twice now, and the need for ops with two outputs (which may be harder, TBH) has come up once. That would mean that the edges in the graph are no longer edges but directed hyperedges.

The tracker (#1878) captures one use case, for multiple inputs:

The point is that the --set sim.data=foo.json route, where the data input is a "second-class citizen" w/r/t the actual code for simulation, is a hack. We should instead treat foo.json as a proper input, just like the Verilog program, that goes through its own op-based transformation, discovered with a BFS traversal of the op graph. This would let you, for example, do fud2 foo.json -o foo.hex or whatever to run the data conversion alone, and it would allow other formats other than the "blessed" JSON format to work transparently.

In other words: fud2 currently has s simulate op, whose source is simulator (a pre-compiled Verilog simulator executable) and whose destination is dat. We would like this op to have a second source: a new state called simdata or something, which represents a directory full of hex data files. Then we'd have an op transforms from dat to simdata. So a simulation run would look like this:

calyx --calyx--> verilog --icarus--> simulator --+
dat --dat2sim--> simdata ------------------------+--simulate--> dat

…or something roughly like that. You can see the hyperedge in there: simulate has two sources and one destination.

(The other use case that came up recently, in conversation with @nathanielnrn and @eys29, is for YXI stuff. In that case, the end-user behavior is a little different: we only want to provide one input, rather than two as above. But we want that input to be transformed in two different ways and then recombined. Namely, we need to convert Calyx to YXI into an AXI wrapper, and also transform the original Calyx, and then combine those two Calyx files into one. That combination is a hyperedge.)

Anyhow, the big problem here is not so much with just representing the hypergraph (that seems simple enough) but with searching for the shortest path. It's important that users don't have to specify the complete path between two states; they can just give the input and output states (and possibly some --through ops to include) and fud2 finds a path for you. The trouble, of course, is that this wouldn't just be a path anymore—I think it's called a hyperpath, and it's not clear how to find these with a simple hypergraph search.

Here are a few alternatives.

Option 1: Shortest Hyperpath Heuristic

We could read the papers out there about searching for shortest hyperpaths. I'm not entirely sure about the problem statement, and the heuristics available don't look simple. See this page on hyperpaths, for example.

Option 2: Greedy Hyperpath Search

How about this greedy algorithm?

First, we construct a normal graph by flattening the hypergraph. In this new directed graph, there are two kinds of vertices: states (which were vertices in the original hypergraph) and ops (which were hyperedges). Pre-compute the all-pairs shortest paths on this graph.

Given a single destination vertex $t$ and a set of source vertices $s_1, s_2, \dots$ (which are all state vertices), do the following:

  1. Initialize the current hyperpath, $P$, to be the empty set. Initialize the current set of destinations to $\{t\}$ and the current set of sources to $\{s_1, s_2, \dots\}$.
  2. Select an arbitrary destination from the destination set, $t'$.
  3. Consider all the shortest paths $t' \leadsto s_1, t' \leadsto s_2, \dots$. Pick the shortest one among these and add it to $P$. Remove $t$ from the set of destinations and the selected $s_i$ from the set of sources; now both of these are "satisfied."
  4. Look through all the ops currently included in $P$. Do any of these ops have unsatisfied inputs, i.e., sources that are not currently in $P$? Then add their unsatisfied source vertices to the set of destinations, indicating that we still need to satisfy them.
  5. If there are remaining destinations, go to step 2.

It's probably far from optimal, but it might work?

Option 3: Distinguish Secondary Inputs

This one's due to @bcarlet. Maybe we don't want an actual hypergraph; instead, we want ops to keep having a single primary input and primary output state. But ops can also have secondary inputs that they also need. The search algorithm would look a little like the greedy algorithm above, but with less "select an arbitrary destination from the available destinations".

Given a single destination vertex $t$ and a set of source vertices $s_1, s_2, \dots$, do the following:

  1. As above, look at all the shortest paths from $t$ to each of the sources and pick the shortest one. Commit to this path.
  2. Look along the path at all the ops. Do any of them have secondary inputs? If so, recursively call this algorithm on that secondary input state (as destination) and the same set of sources, $s_1, s_2, \dots$.

It is a little weird for ops to need to distinguish between "primary" and "secondary" inputs in this way, but it nonetheless might be a good idea. Maybe it would keep the algorithm more predictable too, if the way people want to think about this is already divided into primary and secondary inputs? In our example above, it seems natural enough that the program itself is a primary input and the data is a secondary input.

In all of this, it's not clear how --through would work for secondary paths. Maybe we don't need it.

@sampsyo sampsyo added S: Discussion needed Issues blocked on discussion C: fud2 experimental driver labels Mar 5, 2024
@sampsyo sampsyo mentioned this issue Mar 5, 2024
22 tasks
@rachitnigam
Copy link
Contributor

I like the idea of primary and secondary paths. fud also has a similar notion with the Path.also_do_path method which specifies how a particular step requires a sequence of actions to generate a particular file. This is used when the symbolic stage provides a spec that needs to be compiled (in addition to the program needing to be compiled).

@rachitnigam
Copy link
Contributor

Is there still top-level discussion needed for this issue? If we have a resolution, can we mark this as "available" instead?

@rachitnigam rachitnigam added S: Needs Triage Issue needs some thinking and removed S: Discussion needed Issues blocked on discussion labels Apr 25, 2024
@jku20
Copy link
Collaborator

jku20 commented Jun 4, 2024

Going to take a go at implementing that with basically option 3:

The plan ignoring the --through:
Represent the hypergraph as a weighted (with 0 and 1) bipartite digraph where nodes represent transitions or states (like as suggested in option 2).
Let the weights initially all be 1.
Order output states by some metric for which should go first (or simply try all permutations or randomize if number of states is small).
I probably will choose some arbitrary metric but it could even be user specified (but I probably won't do that unless it feels needed).
(In option 3, this metric is shortest path to a source node)
Chose the smallest state, then search backwards to find the shortest tree with all source leaves.
Now do repeat the process with the rest of the outputs.

--through puts a bit of a wrench into the above so do this in addition:
Associate ops passed by --through argument with a pair of sets, a subset of sources and a subset of targets.
We then basically perform the above algorithm twice, once to find the stuff passed by --through and once to find the initial input.

Slightly more precisely, when searching backwards from one output, look for both ops nodes passed in by --through instead of inputs.
Make sure to only look at the ops passed in by --through if the output node is also contained in the associated output set.

Then once an op is found, search for sources from that. This is basically what is already done, but now there are multiple targets and sources.
This means we get trees instead of normal paths.

This is certainly not optimal a lot of the time, but I think should work decently well when the multiple inputs have pretty separate paths.
(source, I made it up)

Either way, it should be a good enough implementation to test out what the changes will look like.
It shouldn't be too hard to swap out the path finding function later if this algorithm is too poorly behaved.

@sampsyo
Copy link
Contributor Author

sampsyo commented Jun 5, 2024

Just wanted to record here that we had a very brief discussion synchronously and this sounds like a great plan!!

@sampsyo sampsyo added S: Available Can be worked upon and removed S: Needs Triage Issue needs some thinking labels Jun 5, 2024
jku20 added a commit that referenced this issue Jul 8, 2024
…2134)

This PR begins progress towards addressing #1958. It updates the
algorithm used by `fud2` for finding paths from input files to output
files to take in operations which have many inputs and outputs.

The behavior concurs with the current `fud2` in some but not all cases
and does not maintain `fud2`'s always finding the shortest path of
operations.

Specifying operations themselves and emitting the Ninja file still
assume single state inputs and outputs. This will be changed in further
PRs.
@jku20
Copy link
Collaborator

jku20 commented Aug 5, 2024

This problem was reframed from a hypergraph search to a program synthesis problem. There currently are implemented, limited solutions based on simply enumerating possibilities and egg (limited in enumerating possibilities doesn't scale much further than the current size and the solution using egg misses some plans).

I think it makes sense to close this in favor of another issue more current.

@jku20
Copy link
Collaborator

jku20 commented Aug 5, 2024

See #2247 now for more current information.

@jku20 jku20 closed this as completed Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: fud2 experimental driver S: Available Can be worked upon
Projects
None yet
Development

No branches or pull requests

3 participants