-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
randomized svd draft #3008
base: main
Are you sure you want to change the base?
randomized svd draft #3008
Conversation
@petrelharp Here's the code. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3008 +/- ##
==========================================
- Coverage 89.82% 87.07% -2.75%
==========================================
Files 29 11 -18
Lines 31986 24666 -7320
Branches 6192 4556 -1636
==========================================
- Hits 28730 21478 -7252
+ Misses 1859 1824 -35
+ Partials 1397 1364 -33
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
This looks great! Very elegant. I think probably we ought to include a So, how about the signature is like
and:
Note that we could be getting PCs for non-sample nodes (since individual's nodes need not be samples); I haven't thought through whether the values you get are correct or informative. My guess is that maybe they are? But we need a "user beware" note for this? |
Ah, sorry - one more thing - does this work with I think the way to do the windows would be something like
Basically - get it to work in the case where |
A simple test case for the
Because of the randomness of the algo, the correlation is not exactly 1, although it's nearly 1 like 0.99995623-ish. |
I just noticed that |
Check results for two windows.
|
I made a pass through the docs. We need to add |
Co-authored-by: Peter Ralph <[email protected]>
Co-authored-by: Peter Ralph <[email protected]>
Co-authored-by: Peter Ralph <[email protected]>
Co-authored-by: Peter Ralph <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I think we need to tidy up the lint and get tests passing next so we can see how coverage is doing?
python/tskit/trees.py
Outdated
samples, sample_individuals = ( | ||
ij[:, 0], | ||
ij[:, 1], | ||
) # sample node index, individual of those nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting comments at the end of lines is causing them to get broken by Black. Better to put the comments on the line immediately above.
python/tskit/trees.py
Outdated
The principal component factors. Columns are orthogonal, with one entry per sample | ||
or individual (see :meth:`pca <.TreeSequence.pca>`). | ||
""" | ||
eigen_values: np.ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eigenvalues is one word, isn't it?
if np.allclose(x, 0): | ||
r = 1.0 | ||
else: | ||
r = np.mean(x / y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not right, as here we want r
to be +/-1, I think?
It looks like the things to do here are:
|
Bumping this one - we want to get ts.pca implemented and released as soon as we can. What's left to do here @hanbin973? Can we help with anything to get it over the line? |
It's just the test codes that are missing. We have to make the tests pass. I told @petrelharp that I will work on it, but well it didn't go as planned :( |
I don't think this works if we use it on a tree sequence where the samples aren't 0,...,n. Trying this out on the SARS-CoV-2 data, I got:
but if I first simplify (ts = ts.simplify()) so that the samples are 0 to n it runs fine. Just needs a simple test for this case I'd imageine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments on return value interface
python/tskit/trees.py
Outdated
:param np.ndarray range_sketch: Sketch matrix for each window. Default is None. | ||
:return: A :class:`.PCAResult` object, containing estimated principal components, | ||
eigenvalues, and other information. | ||
The principal component loadings are in U |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last bit is out of date now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always confused by what's a loading and what's a loading score (or factor). U is should be the score. Will work on it.
|
||
|
||
""" | ||
factors: np.ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to define the dimensions here. Currently the first dimension is samples/individuals and the second is the num_componnents. Is there a strong reason for doing it this way? I expected it to be (num_components, num_samples)
as you usually want to access all the values for a given component together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, it's following scikit learn. That's an excellent guide to follow - let's just do what scikit learn does and make our API as compatible with them as possible? We should document this as a stated goal.
I successfully ran this on a 2.5M sample SARS-CoV-2 ARG. Took about 30 seconds, and seemed to converge OK (at least from my cursory check using the recommended approach). A summary of the results is here: jeromekelleher/sc2ts-paper#372 |
python/tskit/trees.py
Outdated
@@ -8779,17 +8804,16 @@ def _rand_pow_range_finder( | |||
""" | |||
Algorithm 9 in https://arxiv.org/pdf/2002.01387 | |||
""" | |||
assert num_vectors >= rank > 0, "num_vectors should be larger than rank" | |||
assert num_vectors >= rank > 0, "num_vectors should not be smaller than rank" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the words to match the math; is this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. geq/leq are tricky.
else: | ||
Q = range_sketch | ||
for _ in range(depth): | ||
Q = np.linalg.qr(Q).Q | ||
Q = operator(Q) | ||
Q = np.linalg.qr(Q).Q | ||
return Q[:, :rank] | ||
return Q |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To pass Q back into the method (losslessly) we need to pass the whole thing, not just the top rank
columns.
Okay - I've put in a bunch of the testing code for correct arguments, etcetera. Still TODO:
One issue I've turned up along the way is that the Have we convinced ourselves that the default I also refactored the code to randomly generate I added an |
Also - I wonder if a better name for Oh, I see that scikit-learn says |
I'd vote for compatible with scikit second, compatible with our other APIs first. |
What is the exact failure about? Is it the eigenvalues not matching or the factor scores not matching? In terms of the factor scores, I think it's better to compare |
The failure is that the two answers are Definitely Not the Same. Clicking on the "Tests / ..." link above, or running locally, we get
To run this locally, do
I hear what you're saying about comparing But - I don't think that's the problem - inserting a
|
Description
A draft of randomized principal component analysis (PCA) using the
TreeSequence.genetic_relatedness_vector
. The implementation containsspicy.sparse
which should eventually be removed.This part of the code is only used when collapsing a
#sample * #sample
GRM into a#individual * #individual
matrix.Therefore, it will not be difficult to replace with pure numpy.
The API was partially taken from scikit-learn.
To add some details,
iterated_power
is the number of power iterations in the range finder in the randomized algorithm. The error of SVD decreases exponentially as a function of this number.The effect of power iteration is profound when the eigen spectrum of the matrix decays slowly, which seems to be the case of tree sequence GRMs in my experience.
indices
specifies the individuals to be included in the PCA, although decreasing the number of individuals does not meaningfully reduce the amount of computation.