Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

randomized svd draft #3008

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft

randomized svd draft #3008

wants to merge 17 commits into from

Conversation

hanbin973
Copy link

Description

A draft of randomized principal component analysis (PCA) using the TreeSequence.genetic_relatedness_vector. The implementation contains spicy.sparse which should eventually be removed.
This part of the code is only used when collapsing a #sample * #sample GRM into a #individual * #individual matrix.
Therefore, it will not be difficult to replace with pure numpy.

The API was partially taken from scikit-learn.

To add some details, iterated_power is the number of power iterations in the range finder in the randomized algorithm. The error of SVD decreases exponentially as a function of this number.
The effect of power iteration is profound when the eigen spectrum of the matrix decays slowly, which seems to be the case of tree sequence GRMs in my experience.

indices specifies the individuals to be included in the PCA, although decreasing the number of individuals does not meaningfully reduce the amount of computation.

@hanbin973
Copy link
Author

@petrelharp Here's the code.

Copy link

codecov bot commented Oct 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.07%. Comparing base (76ab046) to head (791c7da).
Report is 12 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3008      +/-   ##
==========================================
- Coverage   89.82%   87.07%   -2.75%     
==========================================
  Files          29       11      -18     
  Lines       31986    24666    -7320     
  Branches     6192     4556    -1636     
==========================================
- Hits        28730    21478    -7252     
+ Misses       1859     1824      -35     
+ Partials     1397     1364      -33     
Flag Coverage Δ
c-tests 86.69% <ø> (ø)
lwt-tests 80.78% <ø> (ø)
python-c-tests 89.05% <ø> (ø)
python-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 18 files with indirect coverage changes

python/tskit/trees.py Outdated Show resolved Hide resolved
Comment on lines 8711 to 8715
x = individual_idx_sparray(ts.num_individuals, cols).dot(x)
x = sample_individual_sparray(ts).dot(x)
x = ts.genetic_relatedness_vector(W=x, windows=windows, mode="branch", centre=False)
x = sample_individual_sparray(ts).T.dot(x)
x = individual_idx_sparray(ts.num_individuals, rows).T.dot(x)
Copy link
Contributor

@petrelharp petrelharp Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assumes that all individuals' nodes are samples. Note that we can use the nodes argument to genetic_relatedness_vector to get an arbitrary list of (possibly non-sample) nodes; why not just use that?

So, I think we can do something like this:

ij = np.vstack([[n, k] for k, i in enumerate(individuals) for n in self.individual(i).nodes])
sample_list = ij[:, 0]
indiv_index = ij[:, 1]
x = ts.genetic_relatedness_vector(W=x, ..., nodes=sample_list)
x = np.bincount(indiv_index, x)

This also gets rid of the scipy.sparse.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x = ts.genetic_relatedness_vector(W=x, ..., nodes=sample_list)

should slightly be

x = ts.genetic_relatedness_vector(W=x[indiv_index], ..., nodes=sample_list)

to expand the array of individuals to array of nodes, I think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, yes - good catch!

python/tskit/trees.py Outdated Show resolved Hide resolved
python/tskit/trees.py Outdated Show resolved Hide resolved
@petrelharp
Copy link
Contributor

This looks great! Very elegant. I think probably we ought to include a samples argument, though? For consistency, but also since the tree sequence represents phased data, and so it's actually informative to look at the PCs of maternally- and paternally-inherited chromosomes separately.

So, how about the signature is like

def pca(samples=None, individuals=None, ...)

and:

  • the default is equivalent to samples=ts.samples(), individuals=None
  • you can't have both samples and individuals specified
  • if individuals is a list of individual IDs then it does as in the code currently
  • otherwise, is just skips the "sum over individuals" step

Note that we could be getting PCs for non-sample nodes (since individual's nodes need not be samples); I haven't thought through whether the values you get are correct or informative. My guess is that maybe they are? But we need a "user beware" note for this?

@petrelharp petrelharp marked this pull request as draft October 4, 2024 01:46
x = individual_idx_sparray(ts.num_individuals, rows).T.dot(x)
x = self.genetic_relatedness_vector(W=x[sample_individuals], windows=windows, mode="branch", centre=False, nodes=samples)
bincount_fn = lambda w: np.bincount(sample_individuals, w)
x = np.apply_along_axis(bincount_fn, axis=0, arr=x) # I think it should be axis=1, but axis=0 gives the correct values why?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matvec is sometimes GRM * matrix, so x is often a matrix than a vector. np.bincount only works for 1-dimensional weights, so I used np.apply_along_axis and lambda to vectorize np.bincount.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment I left after # looks like mostly a convention issue in that function. When axis=0, the columns are separately retrieved from the array. When axis=1, the rows are retrieved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, that seems confusing but maybe makes sense after all?

individuals: np.ndarray = None,
centre: bool = True,
windows: list = None,
random_state: np.random.Generator = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually we just pass in a seed, any objections to doing that, instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the option from random_state to random_seed following msprime.

@petrelharp
Copy link
Contributor

Ah, sorry - one more thing - does this work with windows? (It looks like not?)

I think the way to do the windows would be something like

drop_windows = windows is None
if drop_windows:
    windows = [0, self.sequence_length]

# then do stuff; with these windows genetic_relatedness will always return an array where the first dimension is "window";
# so you can operate on each slice separately

if drop_windows:
    # get rid of the first dimension in the output

Basically - get it to work in the case where windows are specified (ie not None) and then we can get it to have the right behavior.

@hanbin973
Copy link
Author

A simple test case for the windows feature.

demography = msprime.Demography()
demography.add_population(name="A", initial_size=5_000)
demography.add_population(name="B", initial_size=5_000)
demography.add_population(name="C", initial_size=1_000)
demography.add_population_split(time=1000, derived=["A", "B"], ancestral="C")
ts = msprime.sim_ancestry(
    samples={"A": 500, "B": 500},
    sequence_length=1e6,
    recombination_rate=3e-8,
    demography=demography, 
    random_seed=12)
seq_length = ts.sequence_length

U, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2, seq_length])
U0, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2])
U1, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[seq_length/2, seq_length])

idx = 0 # idx is the idx-th principal component
# correlation instead of allclose because PCA is rotation symmetric
np.corrcoef(U[0][:,idx], U0[:,idx]), np.corrcoef(U[1][:,idx], U1[:,idx])

Because of the randomness of the algo, the correlation is not exactly 1, although it's nearly 1 like 0.99995623-ish.

@hanbin973
Copy link
Author

I just noticed that centre doesn't work with nodes option. The new commit fixed this problem.

@hanbin973
Copy link
Author

Check results for two windows.

demography = msprime.Demography()
demography.add_population(name="A", initial_size=5_000)
demography.add_population(name="B", initial_size=5_000)
demography.add_population(name="C", initial_size=1_000)
demography.add_population_split(time=1000, derived=["A", "B"], ancestral="C")
seq_length =1e6
ts = msprime.sim_ancestry(
    samples={"A": 500, "B": 500},
    sequence_length=seq_length,
    recombination_rate=3e-8,
    demography=demography, 
    random_seed=12)

# for individuals
U, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2, seq_length])
U0, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2])
U1, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[seq_length/2, seq_length])

idx = 0 # idx is the idx-th principal component
# correlation instead of allclose because PCA is rotation symmetric
np.corrcoef(U[0][:,idx], U0[0][:,idx]), np.corrcoef(U[1][:,idx], U1[0][:,idx])

# for nodes
U, _ = ts.pca(iterated_power=5, random_seed=1, windows=[0, seq_length/2, seq_length])
U0, _ = ts.pca(iterated_power=5, random_seed=1, windows=[0, seq_length/2])
U1, _ = ts.pca(iterated_power=5, random_seed=1, windows=[seq_length/2, seq_length])

idx = 0 # idx is the idx-th principal component
# correlation instead of allclose because PCA is rotation symmetric
np.corrcoef(U[0][:,idx], U0[0][:,idx]), np.corrcoef(U[1][:,idx], U1[0][:,idx])

@@ -8593,138 +8593,188 @@ def genetic_relatedness_vector(
return out

def pca(
self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rearranged these to better match other methods (e.g., windows always come first, so I had it first after n_components)

API partially adopted from `scikit-learn`:
https://scikit-learn.org/dev/modules/generated/sklearn.decomposition.PCA.html
self,
n_components: int = 10,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should not have a default?

def _rand_pow_range_finder(
operator: Callable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linting complains about Callable for some reason

C = operator(Q).T
U_hat, D, V = np.linalg.svd(C, full_matrices=False)
U = Q @ U_hat
return U[:,:rank], D[:rank], V[:rank]
return U[:, :rank], D[:rank], V[:rank]

def _genetic_relatedness_vector_individual(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes are all automatic linting

x = np.apply_along_axis(bincount_fn, axis=0, arr=x)
x = x - x.mean(axis=0) if centre else x # centering within index in cols
x = x - x.mean(axis=0) if centre else x # centering within index in cols

return x

def _genetic_relatedness_vector_node(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same: automatic linting

@petrelharp
Copy link
Contributor

Okay; here I've made a good start at the tests. I think everything is working fine; the tests are not passing because (I think) of numerical tolerance. I could just set the numerical tolerance to like 1e-4 and they'd pass, but I think this is flagging a bigger issue - how do we tell we're getting good answers? I ask because currently the tests pass (at default tolerances) for small numbers of samples but not for 30 samples; if I increase iterated_power then they do pass; so the current defaults seem okay for 30 samples, but I worry that at 10,000 samples they might be very bad. If u, s is one of the output vector, value pairs, can we compare s * u to W @ u and translate the result to some measure of tolerance? Not very good options here that I can think of are

  1. add a tolerance argument and we iterate until this tolerance is reached
  2. return an estimate of error; perhaps we should return some kind of structured output that contains information about convergence as well?

We'd like this to be not a can of worms; I think our goal is to have something that is good enough, and fowards-compatible for an improved method in the future.

Notes:

  • For testing, I use np.linalg.svd, and (of course) had to mess around so that we'd correctly deal with indeterminancy of sign and also indeterminancy of ordering when singular values are the same.
  • Perhaps we should call it num_components rather than n_components to match elsewhere?
  • I did some rearranging of how the operator in _rand_svd was defined to avoid lint complaining (but was unsuccessful; so inserted the NOQA)
  • Some of the changes were due to automatic linting; apologies for the noise there.

TODO:

  • add tests that include samples and individuals arguments
  • test on some weirder tree sequences (that's easy)

@jeromekelleher
Copy link
Member

Just a quick note that I'd be very much in favour of returning a dataclass here rather than a tuple so that the option of returning more information about convergence etc is open.

@hanbin973
Copy link
Author

There's an adaptive rangefinder algorithm described in Halko et al. (https://arxiv.org/pdf/0909.4061, Algo 4.2). I don't see it implemented in scikit-learn (https://scikit-learn.org/dev/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD).
Halko et al. also has an error estimation formula with an overhead similar to the main routine.

I like Jerome's idea to return a class instead of the result. There's an intermediate matrix Q (approximate range of the original matrix) that takes up most of the computation, and storing it is useful in many cases. For example, one can add additional power iterations to Q without doing it from scratch to improve precision.

@hanbin973
Copy link
Author

Not a classobject yet but added random_sketch to the input/output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants