Replies: 4 comments 1 reply
-
This one also performs matrix * vector multiplication similar to the previous code, but the matrix is transposed.
Since we can perform both |
Beta Was this translation helpful? Give feedback.
-
TL;DR PCA on tree branches successfully recovers population structure and is very fast. I implemented randomized SVD on individual-edge design matrix. Randomized SVD only requires v |-> Av and the full matrix A is never required. Therefore, as long as the linear algebra is fast, the randomized SVD remains efficient as well. The functions are:
The simulation code is:
This takes about 5.08s in my machine (laptop). |
Beta Was this translation helpful? Give feedback.
-
Wow, this is amazing @hanbin973! I would love to hear more about this. |
Beta Was this translation helpful? Give feedback.
-
Just to note that I understand this is being worked up into the PR at #3008 |
Beta Was this translation helpful? Give feedback.
-
Hi everyone. Thank you for answering my question on #2882. Here, I will elaborate on my recent work that led to the previous questions. I'm not sure how to write TeX on github, so I apologize for simply copy-and-pasting a screenshot.
I recently discovered an interesting representation of quantitative traits using ARGs. Assuming an additive model, the sum over sites can be written as a sum over edges. Here, edges are actually
bricks
in the sense that sample descendants of edges are constant along their span.The inner summation of sites (p) is approximately a Gaussian random variable. The variance and the mean of these Gaussian variables appear as functions of edge length, edge span, mutation rate of edge/site, and effect size of sites. The exact formula can be obtained using the Lyapunov/Lindeberg-Feller style conditions. Furthermore, these random variables are mutually independent due to the infinite-sites assumptions. I will share the proofs as soon as I figure out how to write TeX in this place (or maybe just link arXiv when I'm done).
The bottom line is that the sample-edge matrix
A
behaves like the genotype matrix, and the random variables described above behave like random effects coefficients in GCTA-like models. Since SNPs and (bricked) edges can be identified, this seems to be a more general theory that subsumes previous SNP-random effects models. After centering the random effects to have mean zero, it also includes a fixed effects portion, which is present only if the mutation rate varies within a site. I think this has some profound connection to non-neutral variants that warrants further discussion.Given a bricked tree sequence (using the
ldgm
package by @awohns), I have a proposal on performing certain matrix multiplications over all edges and samples. Below is an example of counting the number of descendants of each (bricked) edge.I guess the time complexity is something like O( num_edges * log(num_samples) + num_edges )
Maybe these calculations don't really need bricks and can be done on vanilla tree sequences?
In the practical side, my proposal is that testing all edges instead of observed SNPs is the right way to do GWAS. This is similar to Zhang et al. (https://www.nature.com/articles/s41588-023-01379-x), but more exhaustive because Zhang et al. tests a subset of edges after sampling by putting mutations on edges. Using graph-based linear algebra would be much faster. Also, the model argues that principal component analysis for pop-structure adjustment should be obtained on matrix A, and not the genotype matrix G.
Beta Was this translation helpful? Give feedback.
All reactions