Tree-based representation of quantitative traits #2883

hanbin973 · 2023-12-30T13:53:59Z

hanbin973
Dec 30, 2023
Collaborator

Hi everyone. Thank you for answering my question on #2882. Here, I will elaborate on my recent work that led to the previous questions. I'm not sure how to write TeX on github, so I apologize for simply copy-and-pasting a screenshot.

I recently discovered an interesting representation of quantitative traits using ARGs. Assuming an additive model, the sum over sites can be written as a sum over edges. Here, edges are actually bricks in the sense that sample descendants of edges are constant along their span.

The inner summation of sites (p) is approximately a Gaussian random variable. The variance and the mean of these Gaussian variables appear as functions of edge length, edge span, mutation rate of edge/site, and effect size of sites. The exact formula can be obtained using the Lyapunov/Lindeberg-Feller style conditions. Furthermore, these random variables are mutually independent due to the infinite-sites assumptions. I will share the proofs as soon as I figure out how to write TeX in this place (or maybe just link arXiv when I'm done).

The bottom line is that the sample-edge matrix A behaves like the genotype matrix, and the random variables described above behave like random effects coefficients in GCTA-like models. Since SNPs and (bricked) edges can be identified, this seems to be a more general theory that subsumes previous SNP-random effects models. After centering the random effects to have mean zero, it also includes a fixed effects portion, which is present only if the mutation rate varies within a site. I think this has some profound connection to non-neutral variants that warrants further discussion.

Given a bricked tree sequence (using the ldgm package by @awohns), I have a proposal on performing certain matrix multiplications over all edges and samples. Below is an example of counting the number of descendants of each (bricked) edge.

@numba.njit
def _node_ptr_parent(num_nodes, edges_parent):
    nodes_num_echild = np.zeros(num_nodes+1, dtype=np.int32)
    for e_parent in edges_parent:
        nodes_num_echild[e_parent+1] += 1
    return np.cumsum(nodes_num_echild)

@numba.njit
def _bricks_num_samples(edges_child, edges_left, edges_right, num_samples, node_ptr):
    bricks_num_samples = np.zeros(edges_child.shape[0], dtype=np.int32)

    # bricks with samples as child
    for e, e_child in enumerate(edges_child):
        if e_child < num_samples:
            bricks_num_samples[e] += 1 # A * x (x is vector) can be done by replacing 1 with x[e_child]

    # weight propagation on bricks
    for e, e_child in enumerate(edges_child):
        e_begin, e_end = node_ptr[e_child], node_ptr[e_child+1]
        for e_cc in range(e_begin, e_end):
            if (edges_right[e_cc] > edges_left[e]) and (edges_right[e] > edges_left[e_cc]): # filter overlapping bricks
                if edges_left[e_cc] <= edges_left[e]: # only count bricks that appeared earlier
                    bricks_num_samples[e] += bricks_num_samples[e_cc]

    return bricks_num_samples

I guess the time complexity is something like O( num_edges * log(num_samples) + num_edges )

Maybe these calculations don't really need bricks and can be done on vanilla tree sequences?

In the practical side, my proposal is that testing all edges instead of observed SNPs is the right way to do GWAS. This is similar to Zhang et al. (https://www.nature.com/articles/s41588-023-01379-x), but more exhaustive because Zhang et al. tests a subset of edges after sampling by putting mutations on edges. Using graph-based linear algebra would be much faster. Also, the model argues that principal component analysis for pop-structure adjustment should be obtained on matrix A, and not the genotype matrix G.

hanbin973 · 2024-01-07T11:57:58Z

hanbin973
Jan 7, 2024
Collaborator Author

This one also performs matrix * vector multiplication similar to the previous code, but the matrix is transposed.

@numba.njit
def rev_enumerate(iterable, start=0):
    count = len(iterable) - 1
    for elem in iterable[::-1]:
        yield count, elem
        count -= 1

@numba.njit
def _samples_num_bricks(edges_child, edges_left, edges_right, num_samples, node_ptr):
    edges_num_ancestors = np.zeros(edges_child.shape[0], dtype=np.int32)
  
    for e, e_child in rev_enumerate(edges_child):
        e_begin, e_end = node_ptr[e_child], node_ptr[e_child+1]
        for e_cc in range(e_begin, e_end):
            if (edges_right[e_cc] > edges_left[e]) and (edges_right[e] > edges_left[e_cc]):
                if edges_left[e_cc] <= edges_left[e]:
                    edges_num_ancestors[e_cc] += edges_num_ancestors[e] + 1 # replace 1 with x[e]

    samples_num_bricks = np.zeros(num_samples)
    sample_count = 0
    for e, e_child in enumerate(edges_child):
        if e_child < num_samples:
            samples_num_bricks[e_child] += edges_num_ancestors[e] + 1 # replace 1 with x[e]
        if sample_count == num_samples:
            break

    return samples_num_bricks

Since we can perform both Av and A^t v, the next step is to do PCA using modern SVD algorithms (e.g. https://www.jstatsoft.org/article/view/v089i11). I'm curious if there are name conventions for tskit. I would like to adjust the variable names before proceeding much further.

0 replies

hanbin973 · 2024-01-08T07:49:35Z

hanbin973
Jan 8, 2024
Collaborator Author

TL;DR PCA on tree branches successfully recovers population structure and is very fast.
see:

I implemented randomized SVD on individual-edge design matrix. Randomized SVD only requires v |-> Av and the full matrix A is never required. Therefore, as long as the linear algebra is fast, the randomized SVD remains efficient as well.

The functions are:

import numpy as np
import numpy.linalg as linalg
import scipy.sparse as sparse
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import demes, msprime, demesdraw
import tskit
import numba 
import ldgm

@numba.njit
def rev_enumerate(iterable, start=0):
    count = len(iterable) - 1
    for elem in iterable[::-1]:
        yield count, elem
        count -= 1

@numba.njit
def _node_ptr_parent(num_nodes, edges_parent):
    nodes_num_echild = np.zeros(num_nodes+1, dtype=np.int32)
    for e_parent in edges_parent:
        nodes_num_echild[e_parent+1] += 1
    return np.cumsum(nodes_num_echild)

@numba.njit
def _collapse_samples_1d(edges_child, edges_left, edges_right, num_samples, node_ptr, samples_val):
    """
    Perform sample-side matrix-vector multiplication between the sample-brick incidence matrix and a vector.

    :param np.ndarray edges_child: An array of child nodes of edges.
    :param np.ndarray edges_left: An array of left coordinates of edges.
    :param np.ndarray edges_right: An array of right coordinates of edges.
    :param int num_samples: Number of samples of tree sequence.
    :param np.ndarray node_ptr: Pointer to edge indices ordered by parent nodes.
    :param np.ndarray samples_val: Vector of sample values. This is the multiplied vector.

    :return: Result vector of the product.
    :rtype: np.ndarray
    """

    # output vector
    edges_out = np.zeros(edges_child.shape[0])

    # initialize edges with samples as childs
    sample_count = 0
    for e, e_child in enumerate(edges_child):
        if e_child < num_samples:
            edges_out[e] += samples_val[e_child]
            sample_count += 1
        if sample_count == num_samples:
            break

    # propagate sample values upwards from leaves to roots
    for e, e_child in enumerate(edges_child): 
        e_begin, e_end = node_ptr[e_child], node_ptr[e_child+1]
        for e_cc in range(e_begin, e_end):
            if (edges_right[e_cc] > edges_left[e]) and (edges_right[e] > edges_left[e_cc]):
                if edges_left[e_cc] <= edges_left[e]:
                    edges_out[e] += edges_out[e_cc]

    return edges_out
    
@numba.njit
def _collapse_bricks_1d(edges_child, edges_left, edges_right, num_samples, node_ptr, edges_val):
    """
    Perform brick-side matrix-vector multiplication between the sample-brick incidence matrix and a vector.

    :param np.ndarray edges_child: An array of child nodes of edges.
    :param np.ndarray edges_left: An array of left coordinates of edges.
    :param np.ndarray edges_right: An array of right coordinates of edges.
    :param int num_samples: Number of samples of tree sequence.
    :param np.ndarray node_ptr: Pointer to edge indices ordered by parent nodes.
    :param np.ndarray edges_val: Vector of brick values. This is the multiplied vector.

    :return: Result vector of the product.
    :rtype: np.ndarray
    """
    
    # stores values inheriting from parent edges 
    edges_lazy = np.zeros(edges_child.shape[0])
    
    # propagate brick values downwards from roots to leaves
    for e, e_child in rev_enumerate(edges_child): 
        e_begin, e_end = node_ptr[e_child], node_ptr[e_child+1]
        for e_cc in range(e_begin, e_end):
            if (edges_right[e_cc] > edges_left[e]) and (edges_right[e] > edges_left[e_cc]):
                if edges_left[e_cc] <= edges_left[e]:
                    edges_lazy[e_cc] += edges_lazy[e] + edges_val[e] 

    # output vector
    samples_out = np.zeros(num_samples)
    sample_count = 0
    for e, e_child in enumerate(edges_child):
        if e_child < num_samples:
            samples_out[e_child] += edges_lazy[e] + edges_val[e]
        if sample_count == num_samples:
            break

    return samples_out

def ts_brick_prod_samples(ts, node_ptr, mat):
    num_row, num_col = mat.shape

    out = np.empty((ts.num_edges, num_col))
    for j in range(num_col):
        out[:,j] = _collapse_samples_1d(
            ts.edges_child,
            ts.edges_left,
            ts.edges_right,
            ts.num_samples,
            node_ptr,
            mat[:,j]
        )
    
    return out

def ts_brick_prod_bricks(ts, node_ptr, mat):
    num_row, num_col = mat.shape

    out = np.empty((ts.num_samples, num_col))
    for j in range(num_col):
        out[:,j] = _collapse_bricks_1d(
            ts.edges_child,
            ts.edges_left,
            ts.edges_right,
            ts.num_samples,
            node_ptr,
            mat[:,j]
        )
    
    return out

def _sp_individuals_samples(ts):

    out = sparse.csc_matrix(
        (
            np.ones(ts.num_samples), 
            ts.nodes_individual[ts.samples()], 
            np.arange(ts.num_samples+1)
        ), 
        shape=(ts.num_individuals, ts.num_samples)
    )

    return out

def ts_prod_samples(ts, node_ptr, mat):

    out = ts_brick_prod_samples(
        ts, 
        node_ptr, 
        _sp_individuals_samples(ts).T.dot(mat)
    )  

    return out

def ts_prod_bricks(ts, node_ptr, mat):
    
    out = _sp_individuals_samples(ts).dot(ts_brick_prod_bricks(ts, node_ptr, mat))

    return out

def subspace_iter(ts, node_ptr, sketch_mat, n_iter):
    for j in range(n_iter):
        q = linalg.qr(sketch_mat).Q
        q = linalg.qr(ts_brick_prod_samples(ts, node_ptr, q)).Q
        sketch_mat = ts_brick_prod_bricks(ts, node_ptr, q)
    return sketch_mat

def subspace_iter_individuals(ts, node_ptr, sketch_mat, n_iter):
    for j in range(n_iter):
        q = linalg.qr(sketch_mat).Q
        q = linalg.qr(ts_prod_samples(ts, node_ptr, q)).Q
        sketch_mat = ts_prod_bricks(ts, node_ptr, q)
    return sketch_mat

def svd_samples_edges(ts, k=10, p=5, q=3):

    node_ptr = _node_ptr_parent(ts.num_nodes, ts.edges_parent)

    # construct sketch matrix
    random_test = np.random.normal(size=(ts.num_edges, k+p))
    sketch = ts_prod_bricks(ts, node_ptr, random_test)

    # subspace iteration
    sketch = subspace_iter_individuals(ts, node_ptr, sketch, q)

    # obtain basis
    basis = linalg.qr(sketch).Q
    proj = ts_prod_samples(ts, node_ptr, basis).T

    # exact SVD on basis
    U, S, V = linalg.svd(proj, full_matrices=False)

    return basis @ U

model is a copy-paste from the demes tutorial.

model = demes.load('test.yml')

The simulation code is:

demography = msprime.Demography.from_demes(model)
samples = {'A':2000, 'B':2000}
seq_length = 1e6
ts = msprime.sim_ancestry(
    [msprime.SampleSet(n, ploidy=2, population=p) for p, n in samples.items()],
    demography=demography,
    model=[
        msprime.StandardCoalescent(),
    ],
    recombination_rate=1e-7,
    sequence_length=seq_length,
    record_migrations=True,
    random_seed=1
)
bts = ldgm.brick_ts(ts)
samples_pc = svd_samples_edges(bts)

This takes about 5.08s in my machine (laptop).

1 reply

gregorgorjanc Jan 8, 2024
Collaborator

@hanbin973 your are a machine! @brieuclehmann @petrelharp @jeromekelleher have been doing something similar to do PCA directly from tree sequence, by leveraging relatedness code/approach. We should compare and discuss…

jeromekelleher · 2024-01-08T10:17:54Z

jeromekelleher
Jan 8, 2024
Maintainer

Wow, this is amazing @hanbin973! I would love to hear more about this.

0 replies

hyanwong · 2024-12-06T21:02:04Z

hyanwong
Dec 6, 2024
Maintainer

Just to note that I understand this is being worked up into the PR at #3008

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tree-based representation of quantitative traits #2883

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Tree-based representation of quantitative traits #2883

hanbin973 Dec 30, 2023 Collaborator

Replies: 4 comments · 1 reply

hanbin973 Jan 7, 2024 Collaborator Author

hanbin973 Jan 8, 2024 Collaborator Author

gregorgorjanc Jan 8, 2024 Collaborator

jeromekelleher Jan 8, 2024 Maintainer

hyanwong Dec 6, 2024 Maintainer

hanbin973
Dec 30, 2023
Collaborator

Replies: 4 comments 1 reply

hanbin973
Jan 7, 2024
Collaborator Author

hanbin973
Jan 8, 2024
Collaborator Author

gregorgorjanc Jan 8, 2024
Collaborator

jeromekelleher
Jan 8, 2024
Maintainer

hyanwong
Dec 6, 2024
Maintainer