-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
randomized svd draft #3008
base: main
Are you sure you want to change the base?
randomized svd draft #3008
Conversation
@petrelharp Here's the code. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3008 +/- ##
==========================================
- Coverage 89.82% 87.07% -2.75%
==========================================
Files 29 11 -18
Lines 31986 24666 -7320
Branches 6192 4556 -1636
==========================================
- Hits 28730 21478 -7252
+ Misses 1859 1824 -35
+ Partials 1397 1364 -33
Flags with carried forward coverage won't be shown. Click here to find out more. |
python/tskit/trees.py
Outdated
x = individual_idx_sparray(ts.num_individuals, cols).dot(x) | ||
x = sample_individual_sparray(ts).dot(x) | ||
x = ts.genetic_relatedness_vector(W=x, windows=windows, mode="branch", centre=False) | ||
x = sample_individual_sparray(ts).T.dot(x) | ||
x = individual_idx_sparray(ts.num_individuals, rows).T.dot(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this assumes that all individuals' nodes are samples. Note that we can use the nodes
argument to genetic_relatedness_vector
to get an arbitrary list of (possibly non-sample) nodes; why not just use that?
So, I think we can do something like this:
ij = np.vstack([[n, k] for k, i in enumerate(individuals) for n in self.individual(i).nodes])
sample_list = ij[:, 0]
indiv_index = ij[:, 1]
x = ts.genetic_relatedness_vector(W=x, ..., nodes=sample_list)
x = np.bincount(indiv_index, x)
This also gets rid of the scipy.sparse
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x = ts.genetic_relatedness_vector(W=x, ..., nodes=sample_list)
should slightly be
x = ts.genetic_relatedness_vector(W=x[indiv_index], ..., nodes=sample_list)
to expand the array of individuals to array of nodes, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah, yes - good catch!
This looks great! Very elegant. I think probably we ought to include a So, how about the signature is like
and:
Note that we could be getting PCs for non-sample nodes (since individual's nodes need not be samples); I haven't thought through whether the values you get are correct or informative. My guess is that maybe they are? But we need a "user beware" note for this? |
python/tskit/trees.py
Outdated
x = individual_idx_sparray(ts.num_individuals, rows).T.dot(x) | ||
x = self.genetic_relatedness_vector(W=x[sample_individuals], windows=windows, mode="branch", centre=False, nodes=samples) | ||
bincount_fn = lambda w: np.bincount(sample_individuals, w) | ||
x = np.apply_along_axis(bincount_fn, axis=0, arr=x) # I think it should be axis=1, but axis=0 gives the correct values why? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The matvec is sometimes GRM * matrix, so x
is often a matrix than a vector. np.bincount
only works for 1-dimensional weights, so I used np.apply_along_axis
and lambda
to vectorize np.bincount
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment I left after # looks like mostly a convention issue in that function. When axis=0
, the columns are separately retrieved from the array. When axis=1
, the rows are retrieved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, that seems confusing but maybe makes sense after all?
python/tskit/trees.py
Outdated
individuals: np.ndarray = None, | ||
centre: bool = True, | ||
windows: list = None, | ||
random_state: np.random.Generator = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually we just pass in a seed
, any objections to doing that, instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the option from random_state
to random_seed
following msprime.
Ah, sorry - one more thing - does this work with I think the way to do the windows would be something like
Basically - get it to work in the case where |
A simple test case for the
Because of the randomness of the algo, the correlation is not exactly 1, although it's nearly 1 like 0.99995623-ish. |
I just noticed that |
Check results for two windows.
|
python/tskit/trees.py
Outdated
@@ -8593,138 +8593,188 @@ def genetic_relatedness_vector( | |||
return out | |||
|
|||
def pca( | |||
self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rearranged these to better match other methods (e.g., windows
always come first, so I had it first after n_components
)
python/tskit/trees.py
Outdated
API partially adopted from `scikit-learn`: | ||
https://scikit-learn.org/dev/modules/generated/sklearn.decomposition.PCA.html | ||
self, | ||
n_components: int = 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we should not have a default?
python/tskit/trees.py
Outdated
def _rand_pow_range_finder( | ||
operator: Callable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linting complains about Callable
for some reason
C = operator(Q).T | ||
U_hat, D, V = np.linalg.svd(C, full_matrices=False) | ||
U = Q @ U_hat | ||
return U[:,:rank], D[:rank], V[:rank] | ||
return U[:, :rank], D[:rank], V[:rank] | ||
|
||
def _genetic_relatedness_vector_individual( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes are all automatic linting
x = np.apply_along_axis(bincount_fn, axis=0, arr=x) | ||
x = x - x.mean(axis=0) if centre else x # centering within index in cols | ||
x = x - x.mean(axis=0) if centre else x # centering within index in cols | ||
|
||
return x | ||
|
||
def _genetic_relatedness_vector_node( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same: automatic linting
Okay; here I've made a good start at the tests. I think everything is working fine; the tests are not passing because (I think) of numerical tolerance. I could just set the numerical tolerance to like 1e-4 and they'd pass, but I think this is flagging a bigger issue - how do we tell we're getting good answers? I ask because currently the tests pass (at default tolerances) for small numbers of samples but not for 30 samples; if I increase
We'd like this to be not a can of worms; I think our goal is to have something that is good enough, and fowards-compatible for an improved method in the future. Notes:
TODO:
|
Just a quick note that I'd be very much in favour of returning a dataclass here rather than a tuple so that the option of returning more information about convergence etc is open. |
There's an adaptive rangefinder algorithm described in Halko et al. (https://arxiv.org/pdf/0909.4061, Algo 4.2). I don't see it implemented in scikit-learn (https://scikit-learn.org/dev/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD). I like Jerome's idea to return a class instead of the result. There's an intermediate matrix |
Not a classobject yet but added random_sketch to the input/output. |
Description
A draft of randomized principal component analysis (PCA) using the
TreeSequence.genetic_relatedness_vector
. The implementation containsspicy.sparse
which should eventually be removed.This part of the code is only used when collapsing a
#sample * #sample
GRM into a#individual * #individual
matrix.Therefore, it will not be difficult to replace with pure numpy.
The API was partially taken from scikit-learn.
To add some details,
iterated_power
is the number of power iterations in the range finder in the randomized algorithm. The error of SVD decreases exponentially as a function of this number.The effect of power iteration is profound when the eigen spectrum of the matrix decays slowly, which seems to be the case of tree sequence GRMs in my experience.
indices
specifies the individuals to be included in the PCA, although decreasing the number of individuals does not meaningfully reduce the amount of computation.