Replies: 4 comments 4 replies
-
Super cool! Do you have a sense of what the constraints are on the per-vertex computation, if any? Also, out of curiosity, are the edges weighted by the number of individuals with that particular gene path or are all the edges considered identical? |
Beta Was this translation helpful? Give feedback.
-
For an initial prototype, what do you think about writing the main loops as some Dahlia code first? We did this for a baseline with NTT and it has the benefit of giving us a specification to work with. |
Beta Was this translation helpful? Give feedback.
-
One the first steps here would be defining a concrete implementation task (like a loop, or a common computation). Let's enumerate some candidates for this! |
Beta Was this translation helpful? Give feedback.
-
@sampsyo if we have a good starting step, can we remove the "discussion needed" label? |
Beta Was this translation helpful? Give feedback.
-
Alright, here's yet another one of these little "project proposal" write-ups. I'm particularly excited about this one; I think it's concrete, has several good potential outcomes, and has a high chance for success.
The proposal is to build an experimental Calyx frontend for pangenomics queries. (What is pangenomics? Answer below.) To recap why we want to build frontends for exciting application areas:
The application I have in mind is based on a grant we recently got in collaboration with some folks in ECE (Chris Batten is the lead PI) and some domain experts in computational pangenomics. You've heard of genomics; in most cases, people are concerns with assembling and analyzing one individual organism's DNA sequence at a time. But that kind of work can miss information that's present in the varation between many individuals of a single species, i.e., the genes in a whole population. Pangenomics is the study of entire populations' genomes, where individuals are modeled as pairwise variations from each other, rather than just variations on a single reference genome (which can bias what you measure about diverse individuals). Read more motivation, with glamor shot!
Anyway, I have learned that one of the main things that happens in pangenomics workloads is graph traversal. The idea (in my approximate, layperson's understanding) is that you have a big graph where vertices are genes and individuals are represented as paths through the graph. So you could ask, for example, what the popular genes are among all the individual-paths you have in your dataset. That would require walking all of your paths through the graph. This kind of path traversal is a pangenome query.
There is some built-in parallelism in queries at the granularity of paths. Our collaborators made a package called odgi, for example, that has both scaffolding for iterating over sets of paths and for following individual paths through a pangenome graph. See this loop nest for an example query.
The proposal is to build a generator to produce fixed-function pangenome query accelerators. The rough idea would be to hand-roll some scaffolding for doing path traversal in a pangenome graph. The frontend "DSL" would just specify what to do at every vertex. For the aforelinked "node depth" computation, for instance, the per-vertex computation boils down mostly to an accumulation. Hopefully this is enough flexibility to do interesting queries.
As some initial simplifying assumptions, we would do these things, which would want to be relaxed in future iterations for maximum success:
An natural question is how this differs from @rich140's recent work on a graph accelerator. I think we can build on that, but a potential advantage is that this could be a lot simpler because of the constrained style of iteration. We would not allow arbitrary graph computations, as a mainstream accelerator like Graphicionado does. By focusing on this very specific parallelism pattern, maybe we can get some early wins without making things too complex.
Some reasons I think this would be a an especially useful near-term Calyx application:
Beta Was this translation helpful? Give feedback.
All reactions