Project proposal: Frontend for pangenomic graph queries #884

sampsyo · 2022-01-26T21:28:49Z

sampsyo
Jan 26, 2022
Maintainer

Alright, here's yet another one of these little "project proposal" write-ups. I'm particularly excited about this one; I think it's concrete, has several good potential outcomes, and has a high chance for success.

The proposal is to build an experimental Calyx frontend for pangenomics queries. (What is pangenomics? Answer below.) To recap why we want to build frontends for exciting application areas:

It demonstrates how awesome Calyx is (i.e., external marketing).
It gives us grist for the mill for understanding how Calyx needs to improve (i.e., it evaluates and motivates our own work).
It can help us write better Calyx-related papers because we have more code to experiment with.

The application I have in mind is based on a grant we recently got in collaboration with some folks in ECE (Chris Batten is the lead PI) and some domain experts in computational pangenomics. You've heard of genomics; in most cases, people are concerns with assembling and analyzing one individual organism's DNA sequence at a time. But that kind of work can miss information that's present in the varation between many individuals of a single species, i.e., the genes in a whole population. Pangenomics is the study of entire populations' genomes, where individuals are modeled as pairwise variations from each other, rather than just variations on a single reference genome (which can bias what you measure about diverse individuals). Read more motivation, with glamor shot!

Anyway, I have learned that one of the main things that happens in pangenomics workloads is graph traversal. The idea (in my approximate, layperson's understanding) is that you have a big graph where vertices are genes and individuals are represented as paths through the graph. So you could ask, for example, what the popular genes are among all the individual-paths you have in your dataset. That would require walking all of your paths through the graph. This kind of path traversal is a pangenome query.

There is some built-in parallelism in queries at the granularity of paths. Our collaborators made a package called odgi, for example, that has both scaffolding for iterating over sets of paths and for following individual paths through a pangenome graph. See this loop nest for an example query.

The proposal is to build a generator to produce fixed-function pangenome query accelerators. The rough idea would be to hand-roll some scaffolding for doing path traversal in a pangenome graph. The frontend "DSL" would just specify what to do at every vertex. For the aforelinked "node depth" computation, for instance, the per-vertex computation boils down mostly to an accumulation. Hopefully this is enough flexibility to do interesting queries.

As some initial simplifying assumptions, we would do these things, which would want to be relaxed in future iterations for maximum success:

Assume all data fits on chip. Don't tangle with off-chip data movement for now, although in practice real pan genomes are very big so we would need to do this to be truly practical.
This is not a programmable accelerator. Just generate a one-off implementation for a specific query. Adding programmability would be for a future phase.

An natural question is how this differs from @rich140's recent work on a graph accelerator. I think we can build on that, but a potential advantage is that this could be a lot simpler because of the constrained style of iteration. We would not allow arbitrary graph computations, as a mainstream accelerator like Graphicionado does. By focusing on this very specific parallelism pattern, maybe we can get some early wins without making things too complex.

Some reasons I think this would be a an especially useful near-term Calyx application:

It's a snazzy application that other accelerator people are not working on.
We have access to world experts in the application domain, who actually want to work with us! This is a hugely valuable resource that other potential frontend projects lack.
We can leverage @rich140's initial work on graph processing.
We would get to collaborate with the rest of the Cornell team working on pangenome stuff.
The initial version is potentially simple enough that we could get something actually useful working within a semester or so.

EclecticGriffin · 2022-01-28T15:58:42Z

EclecticGriffin
Jan 28, 2022
Maintainer

Super cool! Do you have a sense of what the constraints are on the per-vertex computation, if any?

Also, out of curiosity, are the edges weighted by the number of individuals with that particular gene path or are all the edges considered identical?

1 reply

sampsyo Jan 28, 2022
Maintainer Author

That's a great question, and I don't know—I think the breadth of computations that are useful is a main thing to figure out to determine how easy/successful things can be! I'm hoping that the balance goes this way: a relatively constrained set of computations can cover a relatively broad set of useful queries. But if that's wrong, then this project is a lot harder.

I believe that the edges in the path all represent a single individual. However, there is something complicated about these paths and the correspondence to DNA sequences. I think each vertex can correspond to a short snippet of base pairs, and that length can differ for each vertex? I'm not entirely sure. A first phase in this project would be actually understanding a single initial query algorithm so we know what goes into the data structure.

rachitnigam · 2022-01-28T17:18:29Z

rachitnigam
Jan 28, 2022
Maintainer

For an initial prototype, what do you think about writing the main loops as some Dahlia code first? We did this for a baseline with NTT and it has the benefit of giving us a specification to work with.

1 reply

sampsyo Jan 28, 2022
Maintainer Author

Starting with a one-off Dahlia implementation is a good idea! Even if it is very slow (i.e., no parallelism is expressible), this would be a good way to prove it's possible.

rachitnigam · 2022-01-28T17:19:12Z

rachitnigam
Jan 28, 2022
Maintainer

One the first steps here would be defining a concrete implementation task (like a loop, or a common computation). Let's enumerate some candidates for this!

1 reply

sampsyo Jan 28, 2022
Maintainer Author

The candidate I had in mind was the one I linked above: the "node depth" computation from odgi.

rachitnigam · 2022-02-10T16:44:13Z

rachitnigam
Feb 10, 2022
Maintainer

@sampsyo if we have a good starting step, can we remove the "discussion needed" label?

1 reply

sampsyo Feb 10, 2022
Maintainer Author

Yep; removed!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Calyx Infrastructure

Project proposal: Frontend for pangenomic graph queries #884

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

The Calyx Infrastructure

Project proposal: Frontend for pangenomic graph queries #884

sampsyo Jan 26, 2022 Maintainer

Replies: 4 comments · 4 replies

EclecticGriffin Jan 28, 2022 Maintainer

sampsyo Jan 28, 2022 Maintainer Author

rachitnigam Jan 28, 2022 Maintainer

sampsyo Jan 28, 2022 Maintainer Author

rachitnigam Jan 28, 2022 Maintainer

sampsyo Jan 28, 2022 Maintainer Author

rachitnigam Feb 10, 2022 Maintainer

sampsyo Feb 10, 2022 Maintainer Author

sampsyo
Jan 26, 2022
Maintainer

Replies: 4 comments 4 replies

EclecticGriffin
Jan 28, 2022
Maintainer

sampsyo Jan 28, 2022
Maintainer Author

rachitnigam
Jan 28, 2022
Maintainer

sampsyo Jan 28, 2022
Maintainer Author

rachitnigam
Jan 28, 2022
Maintainer

sampsyo Jan 28, 2022
Maintainer Author

rachitnigam
Feb 10, 2022
Maintainer

sampsyo Feb 10, 2022
Maintainer Author