Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

randomized svd draft #3008

Draft
wants to merge 37 commits into
base: main
Choose a base branch
from
Draft

randomized svd draft #3008

wants to merge 37 commits into from

Conversation

hanbin973
Copy link
Contributor

Description

A draft of randomized principal component analysis (PCA) using the TreeSequence.genetic_relatedness_vector. The implementation contains spicy.sparse which should eventually be removed.
This part of the code is only used when collapsing a #sample * #sample GRM into a #individual * #individual matrix.
Therefore, it will not be difficult to replace with pure numpy.

The API was partially taken from scikit-learn.

To add some details, iterated_power is the number of power iterations in the range finder in the randomized algorithm. The error of SVD decreases exponentially as a function of this number.
The effect of power iteration is profound when the eigen spectrum of the matrix decays slowly, which seems to be the case of tree sequence GRMs in my experience.

indices specifies the individuals to be included in the PCA, although decreasing the number of individuals does not meaningfully reduce the amount of computation.

@hanbin973
Copy link
Contributor Author

@petrelharp Here's the code.

Copy link

codecov bot commented Oct 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.07%. Comparing base (76ab046) to head (64f07a4).
Report is 77 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3008      +/-   ##
==========================================
- Coverage   89.82%   87.07%   -2.75%     
==========================================
  Files          29       11      -18     
  Lines       31986    24666    -7320     
  Branches     6192     4556    -1636     
==========================================
- Hits        28730    21478    -7252     
+ Misses       1859     1824      -35     
+ Partials     1397     1364      -33     
Flag Coverage Δ
c-tests 86.69% <ø> (ø)
lwt-tests 80.78% <ø> (ø)
python-c-tests 89.05% <ø> (ø)
python-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 18 files with indirect coverage changes

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@petrelharp
Copy link
Contributor

This looks great! Very elegant. I think probably we ought to include a samples argument, though? For consistency, but also since the tree sequence represents phased data, and so it's actually informative to look at the PCs of maternally- and paternally-inherited chromosomes separately.

So, how about the signature is like

def pca(samples=None, individuals=None, ...)

and:

  • the default is equivalent to samples=ts.samples(), individuals=None
  • you can't have both samples and individuals specified
  • if individuals is a list of individual IDs then it does as in the code currently
  • otherwise, is just skips the "sum over individuals" step

Note that we could be getting PCs for non-sample nodes (since individual's nodes need not be samples); I haven't thought through whether the values you get are correct or informative. My guess is that maybe they are? But we need a "user beware" note for this?

@petrelharp petrelharp marked this pull request as draft October 4, 2024 01:46
@petrelharp
Copy link
Contributor

Ah, sorry - one more thing - does this work with windows? (It looks like not?)

I think the way to do the windows would be something like

drop_windows = windows is None
if drop_windows:
    windows = [0, self.sequence_length]

# then do stuff; with these windows genetic_relatedness will always return an array where the first dimension is "window";
# so you can operate on each slice separately

if drop_windows:
    # get rid of the first dimension in the output

Basically - get it to work in the case where windows are specified (ie not None) and then we can get it to have the right behavior.

@hanbin973
Copy link
Contributor Author

A simple test case for the windows feature.

demography = msprime.Demography()
demography.add_population(name="A", initial_size=5_000)
demography.add_population(name="B", initial_size=5_000)
demography.add_population(name="C", initial_size=1_000)
demography.add_population_split(time=1000, derived=["A", "B"], ancestral="C")
ts = msprime.sim_ancestry(
    samples={"A": 500, "B": 500},
    sequence_length=1e6,
    recombination_rate=3e-8,
    demography=demography, 
    random_seed=12)
seq_length = ts.sequence_length

U, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2, seq_length])
U0, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2])
U1, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[seq_length/2, seq_length])

idx = 0 # idx is the idx-th principal component
# correlation instead of allclose because PCA is rotation symmetric
np.corrcoef(U[0][:,idx], U0[:,idx]), np.corrcoef(U[1][:,idx], U1[:,idx])

Because of the randomness of the algo, the correlation is not exactly 1, although it's nearly 1 like 0.99995623-ish.

@hanbin973
Copy link
Contributor Author

I just noticed that centre doesn't work with nodes option. The new commit fixed this problem.

@hanbin973
Copy link
Contributor Author

Check results for two windows.

demography = msprime.Demography()
demography.add_population(name="A", initial_size=5_000)
demography.add_population(name="B", initial_size=5_000)
demography.add_population(name="C", initial_size=1_000)
demography.add_population_split(time=1000, derived=["A", "B"], ancestral="C")
seq_length =1e6
ts = msprime.sim_ancestry(
    samples={"A": 500, "B": 500},
    sequence_length=seq_length,
    recombination_rate=3e-8,
    demography=demography, 
    random_seed=12)

# for individuals
U, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2, seq_length])
U0, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2])
U1, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[seq_length/2, seq_length])

idx = 0 # idx is the idx-th principal component
# correlation instead of allclose because PCA is rotation symmetric
np.corrcoef(U[0][:,idx], U0[0][:,idx]), np.corrcoef(U[1][:,idx], U1[0][:,idx])

# for nodes
U, _ = ts.pca(iterated_power=5, random_seed=1, windows=[0, seq_length/2, seq_length])
U0, _ = ts.pca(iterated_power=5, random_seed=1, windows=[0, seq_length/2])
U1, _ = ts.pca(iterated_power=5, random_seed=1, windows=[seq_length/2, seq_length])

idx = 0 # idx is the idx-th principal component
# correlation instead of allclose because PCA is rotation symmetric
np.corrcoef(U[0][:,idx], U0[0][:,idx]), np.corrcoef(U[1][:,idx], U1[0][:,idx])

@petrelharp
Copy link
Contributor

I made a pass through the docs. We need to add time_windows to the tests still, and see what's going on with the CI.

hanbin973 and others added 4 commits November 16, 2024 21:04
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I think we need to tidy up the lint and get tests passing next so we can see how coverage is doing?

samples, sample_individuals = (
ij[:, 0],
ij[:, 1],
) # sample node index, individual of those nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting comments at the end of lines is causing them to get broken by Black. Better to put the comments on the line immediately above.

The principal component factors. Columns are orthogonal, with one entry per sample
or individual (see :meth:`pca <.TreeSequence.pca>`).
"""
eigen_values: np.ndarray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eigenvalues is one word, isn't it?

if np.allclose(x, 0):
r = 1.0
else:
r = np.mean(x / y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not right, as here we want r to be +/-1, I think?

@petrelharp
Copy link
Contributor

It looks like the things to do here are:

  • get the tests working (right now they fail with FAILED tests/test_relatedness_vector.py::TestPCA::test_bad_windows - TypeError: pca() got an unexpected keyword argument 'n_components'
  • either remove for now the time_windows argument or write tests for it
  • write tests that exercise the individuals argument (or remove it)
  • write a test that uses range_sketch
  • write tests that exercise iterated_power and num_oversamples: probably, just something that checks whether setting these to bigger numbers still gets us (nearly) the same answer

@jeromekelleher
Copy link
Member

Bumping this one - we want to get ts.pca implemented and released as soon as we can. What's left to do here @hanbin973? Can we help with anything to get it over the line?

@hanbin973
Copy link
Contributor Author

Bumping this one - we want to get ts.pca implemented and released as soon as we can. What's left to do here @hanbin973? Can we help with anything to get it over the line?

It's just the test codes that are missing. We have to make the tests pass. I told @petrelharp that I will work on it, but well it didn't go as planned :(

@jeromekelleher
Copy link
Member

I don't think this works if we use it on a tree sequence where the samples aren't 0,...,n. Trying this out on the SARS-CoV-2 data, I got:

/tmp/ipykernel_620501/3297030094.py in ?()
----> 1 pca = ts.pca(4, samples=ts.samples())

~/.local/lib/python3.10/site-packages/tskit/trees.py in ?(self, num_components, windows, samples, individuals, time_window, mode, centre, iterated_power, num_oversamples, random_seed, range_sketch)
   8856                 else:
   8857                     low = _f_low(arr=x, indices=indices, mode=mode, centre=centre, windows=this_window)
   8858                     return high - low
   8859 
-> 8860             U[i], D[i], _, Q[i], E[i] = _rand_svd(
   8861                 operator=_G,
   8862                 operator_dim=dim,
   8863                 rank=num_components,

~/.local/lib/python3.10/site-packages/tskit/trees.py in ?(operator, operator_dim, rank, depth, num_vectors, rng, range_sketch)
   8803             """
   8804             Algorithm 8 in https://arxiv.org/pdf/2002.01387
   8805             """
   8806             assert num_vectors >= rank > 0
-> 8807             Q = _rand_pow_range_finder(
   8808                 operator, operator_dim, num_vectors, depth, num_vectors, rng, range_sketch
   8809             )
   8810             C = operator(Q).T

~/.local/lib/python3.10/site-packages/tskit/trees.py in ?(operator, operator_dim, rank, depth, num_vectors, rng, range_sketch)
   8786             else:
   8787                 Q = range_sketch
   8788             for _ in range(depth):
   8789                 Q = np.linalg.qr(Q).Q
-> 8790                 Q = operator(Q)
   8791             Q = np.linalg.qr(Q).Q
   8792             return Q[:, :rank]

~/.local/lib/python3.10/site-packages/tskit/trees.py in ?(x)
   8852             def _G(x):
-> 8853                 high = _f_high(arr=x, indices=indices, mode=mode, centre=centre, windows=this_window)
   8854                 if time_window is None:
   8855                     return high
   8856                 else:

~/.local/lib/python3.10/site-packages/tskit/trees.py in ?(self, arr, indices, mode, centre, windows)
   8610         centre: bool = True,
   8611         windows = None,
   8612         ) -> np.ndarray:
   8613         x = arr - arr.mean(axis=0) if centre else arr
-> 8614         x = self._expand_indices(x, indices)
   8615         x = self.genetic_relatedness_vector(
   8616             W=x, windows=windows, mode=mode, centre=False, nodes=indices,
   8617         )[0]

~/.local/lib/python3.10/site-packages/tskit/trees.py in ?(self, x, indices)
   8596         x: np.ndarray,
   8597         indices: np.ndarray
   8598         ) -> np.ndarray:
   8599         y = np.zeros((self.num_samples, x.shape[1]))
-> 8600         y[indices] = x
   8601 
   8602         return y

IndexError: index 2482157 is out of bounds for axis 0 with size 2482157
[6]:

but if I first simplify (ts = ts.simplify()) so that the samples are 0 to n it runs fine. Just needs a simple test for this case I'd imageine.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments on return value interface

:param np.ndarray range_sketch: Sketch matrix for each window. Default is None.
:return: A :class:`.PCAResult` object, containing estimated principal components,
eigenvalues, and other information.
The principal component loadings are in U
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last bit is out of date now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always confused by what's a loading and what's a loading score (or factor). U is should be the score. Will work on it.



"""
factors: np.ndarray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to define the dimensions here. Currently the first dimension is samples/individuals and the second is the num_componnents. Is there a strong reason for doing it this way? I expected it to be (num_components, num_samples) as you usually want to access all the values for a given component together?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, it's following scikit learn. That's an excellent guide to follow - let's just do what scikit learn does and make our API as compatible with them as possible? We should document this as a stated goal.

@jeromekelleher
Copy link
Member

I successfully ran this on a 2.5M sample SARS-CoV-2 ARG. Took about 30 seconds, and seemed to converge OK (at least from my cursory check using the recommended approach). A summary of the results is here: jeromekelleher/sc2ts-paper#372

@@ -8779,17 +8804,16 @@ def _rand_pow_range_finder(
"""
Algorithm 9 in https://arxiv.org/pdf/2002.01387
"""
assert num_vectors >= rank > 0, "num_vectors should be larger than rank"
assert num_vectors >= rank > 0, "num_vectors should not be smaller than rank"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the words to match the math; is this right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. geq/leq are tricky.

else:
Q = range_sketch
for _ in range(depth):
Q = np.linalg.qr(Q).Q
Q = operator(Q)
Q = np.linalg.qr(Q).Q
return Q[:, :rank]
return Q
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To pass Q back into the method (losslessly) we need to pass the whole thing, not just the top rank columns.

@petrelharp
Copy link
Contributor

Okay - I've put in a bunch of the testing code for correct arguments, etcetera. Still TODO:

  • testing for the error bounds
  • figure out why individuals is not passing tests (see below)

One issue I've turned up along the way is that the individuals mode I think was not agreeing with our definition of genetic_relatedness: as described in the relatedness paper, relatedness between two sample sets is the average of the relatedness between pairwise combinations of samples. This is just a factor of 4 for diploids - but that's a discrepancy that will confuse people - and actually makes a difference for mixed ploidy (eg sex chromosomes). I think I've sorted this out correctly.

Have we convinced ourselves that the default iterated_power is a good value? I have a bit of a hard time figuring out what "good" means - like, what tolerance do we need these eigenvectors to?

I also refactored the code to randomly generate range_sketch outside of the loop over windows, because this makes (a) the testing-for-correct-inputs code up top cleaner, and (b) the _rand_pow_range_finder and _rand_svd functions simpler. This has a memory downside of having to have range_sketch for all windows, not just one -- but, this is something we're returning anyhow, so is not a big deal.

I added an individuals argument to the testing implementation, but this isn't matching; I've stared at it a bunch and don't know why. Oddly, ts.pca( samples=x) and ts.pca( individuals=x) match for haploid individuals, but these don't match to the testing code.

@petrelharp
Copy link
Contributor

Also - I wonder if a better name for iterated_power would be power_iterations?

Oh, I see that scikit-learn says iterated_power (and also n_components instead of num_components). Hm. I think that being compatible with them is good, but copying their bad API choices is not requred? (Well, n_components is a fine API, but it isn't consistent with the rest of our API?, which is num_X for things?)

@jeromekelleher
Copy link
Member

I'd vote for compatible with scikit second, compatible with our other APIs first.

@hanbin973
Copy link
Contributor Author

hanbin973 commented Mar 10, 2025

Okay - I've put in a bunch of the testing code for correct arguments, etcetera. Still TODO:

  • testing for the error bounds
  • figure out why individuals is not passing tests (see below)

One issue I've turned up along the way is that the individuals mode I think was not agreeing with our definition of genetic_relatedness: as described in the relatedness paper, relatedness between two sample sets is the average of the relatedness between pairwise combinations of samples. This is just a factor of 4 for diploids - but that's a discrepancy that will confuse people - and actually makes a difference for mixed ploidy (eg sex chromosomes). I think I've sorted this out correctly.

Have we convinced ourselves that the default iterated_power is a good value? I have a bit of a hard time figuring out what "good" means - like, what tolerance do we need these eigenvectors to?

I also refactored the code to randomly generate range_sketch outside of the loop over windows, because this makes (a) the testing-for-correct-inputs code up top cleaner, and (b) the _rand_pow_range_finder and _rand_svd functions simpler. This has a memory downside of having to have range_sketch for all windows, not just one -- but, this is something we're returning anyhow, so is not a big deal.

I added an individuals argument to the testing implementation, but this isn't matching; I've stared at it a bunch and don't know why. Oddly, ts.pca( samples=x) and ts.pca( individuals=x) match for haploid individuals, but these don't match to the testing code.

What is the exact failure about? Is it the eigenvalues not matching or the factor scores not matching? In terms of the factor scores, I think it's better to compare U.T @ U to each other so that we avoid rotation and sign problems. Shall I change the code?

@petrelharp
Copy link
Contributor

The failure is that the two answers are Definitely Not the Same. Clicking on the "Tests / ..." link above, or running locally, we get

FAILED tests/test_relatedness_vector.py::TestPCA::test_individuals[1-0-True] - AssertionError: 
Not equal to tolerance rtol=1e-07, atol=1e-08

Mismatched elements: 5 / 5 (100%)
Max absolute difference among violations: 5733.50077407
Max relative difference among violations: 0.60127089
 ACTUAL: array([3802.135878, 3157.318008, 2779.806231, 2273.337602, 1637.086583])
 DESIRED: array([9535.636652, 5611.718265, 4424.8184  , 4081.358172, 2913.365737])

To run this locally, do

python -m pytest tests/test_relatedness_vector.py::TestPCA::test_individuals[1-0-True] 

I hear what you're saying about comparing U.T @ U, but I think we do want to actually test the individual eigenvectors, not just that the low dimensional approx is right. (Or, maybe I don't understand your suggestion?) The testing code is robust to ordering of eigen-things and sign switches; however, if two eigenvalues are very close, then you're right, we could have problems with nonidentifiabilty that way. Please do have a look at the testing code, though, to see if you have a better suggestion?

But - I don't think that's the problem - inserting a print into the testing code shows that these are the eigenvalues:

[3837.56623916, 3364.17632569, 2843.38691979, 2419.60371452, 655.31546225]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants