Bootstrapping to estimate uncertainty in subpopulation pi #2727

MeaghanClark · 2023-03-24T17:19:57Z

MeaghanClark
Mar 24, 2023

Hi all,

I'm trying to get an estimate of the uncertainty associated with an estimate of pi in a subpopulation. My current strategy is to bootstrap nodes, sampling with replacement. However, I can't sample the same node twice because diversity() throws the "Duplicate sample value" error. Is there a way to bypass this check? Also open to suggestions of other methods to estimate uncertainty.

Thanks!

nspope · 2023-03-25T14:47:22Z

nspope
Mar 25, 2023
Collaborator

Hi! If you're after an estimate of Var(pi) over a large interval, then I'd suggest: calculating (not-span-normalized) pi in windows, resampling these windows with replacement, summing pi over resampled windows, summing window length over resampled windows (to get resampled sequence length), then normalizing. Because pi is a linear statistic, you only need to traverse the tree sequence once. So, something like,

import tskit
import msprime
import numpy as np

# fake data
ts = msprime.sim_ancestry(samples=100, sequence_length=10e6, recombination_rate=1e-8, population_size=10000)

# calculate pi in windows (will work for mode='site' too)
num_windows = 100
num_boot = 50
windows = np.linspace(0, ts.sequence_length, num_windows+1)
window_length = np.diff(windows)
pi_raw = ts.diversity(windows=windows, mode='branch', span_normalise=False)

# resampling with replacement is equivalent to multinomial reweighting
# e.g. matrix-vector product where each matrix row is a draw from a uniform multinomial
weights = np.random.multinomial(num_windows, [1/num_windows]*num_windows, size=num_boot)

# matrix multiplication b/w weights and windows gives vector of bootstrap replicates;
# do this for numerator (sum of pi) and denominator (sum of window length)
pi_boot = (weights @ pi_raw) / (weights @ window_length)
np.std(pi_boot) # 691.04

# compare against repeated simulations from "true" model
ts_gen = msprime.sim_ancestry(samples=100, sequence_length=10e6, recombination_rate=1e-8, population_size=10000, num_replicates=50)
pi_sim = [ts.diversity(mode='branch') for ts in ts_gen]
np.std(pi_sim) # 771.44

In this example, windows are the same length, so all the span normalization stuff is unnecessary. However, in practice you might want to calculate the number of accessible sites per window, and use this as "window_length" -- e.g. for each bootstrap replicate, there'd be a different amount of accessible sequence, and it'd be important to correctly adjust for this.

If you're after uncertainty estimates for pi on a per-tree level, then I think you'd need to jack-knife samples rather than bootstrap -- to me, resampling sample nodes with replacement seems a bit dubious in this genealogical context. There's probably a way to jack-knife pi efficiently with tree sequences, but I haven't worked it out.

3 replies

MeaghanClark Mar 29, 2023
Author

Thanks! I really appreciate this reply and your help. In this case, I’m interested in estimating the variance in pi among the nodes in my subsample rather than variance along the genome, so I could try implementing a jackknife approach. I was wondering if you could explain more about why resampling nodes with replacement is dubious. For more context: I’m trying to estimate pi and variance around pi from a subset of individuals alive at a specific time point and compare that to pi from other subsets of individuals at the same time point as well as from past time points. I’m working with tree sequences generated in SLiM, where I’ve remembered individuals at the timepoints I’m interested in.

nspope Mar 29, 2023
Collaborator

I see! What I meant by "dubious" is that it's not clear to me what the genealogical relationships should be in the resampled data. In other words, I don't know how to sample "pieces" of a genealogy with replacement, and piece them back together in a way that is meaningful. Whereas, when subsetting nodes (e.g. jackknifing) the genealogical relationships are maintained; as is also the case with resampling windows (e.g. stitching different blocks of trees together). So it's not clear to me that sampling nodes with replacement would produce "simulations" that resemble the actual stochastic process that generated the data (although I'd love to be convinced otherwise!)

But your use case is interesting! Basically, you have some subset of individuals of interest in generation t (measuring forwards in time) that are recorded in the tree sequence. You want to calculate Var(pi) using these individuals. Do I have this right? In that case, let's say you were terminate the simulation at time t and output a tree sequence, and then resample windows to get a sequence-wide estimate of Var(pi) for your subpopulation of interest. Would this give you what you want?

If so, then the trick would be to figure out if (and how) it'd be possible to replicate this with a tree sequence from the present day (e.g. where your various subsets of interest are stored as collections of internal nodes or non-contemporary sample nodes). Or, it may be possible to use an estimator that doesn't involve resampling: for example, I suspect it's possible to write Var(pi) in terms of the site frequency spectrum for the samples, akin to what is done in Ragsdale and Gravel for their empirical D^2/pi_2 statistics. That would involve measuring the length of shared branches on the subtrees that relate the samples, which is definitely doable. Let me think about it a bit further!

MeaghanClark Mar 31, 2023
Author

Thank you! That makes sense re: resampling the tree and genealogical relationships.

I think I'm not being super clear about what I'm trying to do, so let me clarify a bit: I’ve simulated a non-WF population with overlapping generations, enacted a severe bottleneck, and run the simulation for a few hundred cycles afterwards. I’ve remembered 100 individuals at a number of different time points before and after the bottleneck. I want to compare diversity between subpopulations between different time points to see in which simulations I see a "significant" decline in diversity (for varying measures of diversity). Because of the bottleneck, I am limited in the number of individuals I can sample. If I estimate Var(pi) using windows, I think I would end up with the uncertainty associated with pi of the specific individuals I sampled, which isn’t exactly what I’m looking for. Instead, I would like to estimate Var(pi) for the subpopulation (of which I only remembered a random sample), which feels like it should be across nodes rather than genomic windows, which is why I was resampling nodes.

I ran into similar issues with estimating the uncertainty around estimates of Wu and Watterson’s theta, but I think I was able to circumvent them by excluding the duplicated node when calculating segregating sites, but including the total number of nodes (including duplicates) in the correction factor in the denominator.

jeromekelleher · 2023-03-30T08:23:29Z

jeromekelleher
Mar 30, 2023
Maintainer

I can't sample the same node twice because diversity() throws the "Duplicate sample value" error. Is there a way to bypass this check?

Just to say here that I don't think there's any deep reason for making this check in the code, and we could easily disable it if it turns out there's a good reason to.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bootstrapping to estimate uncertainty in subpopulation pi #2727

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Bootstrapping to estimate uncertainty in subpopulation pi #2727

MeaghanClark Mar 24, 2023

Replies: 2 comments · 3 replies

nspope Mar 25, 2023 Collaborator

MeaghanClark Mar 29, 2023 Author

nspope Mar 29, 2023 Collaborator

MeaghanClark Mar 31, 2023 Author

jeromekelleher Mar 30, 2023 Maintainer

MeaghanClark
Mar 24, 2023

Replies: 2 comments 3 replies

nspope
Mar 25, 2023
Collaborator

MeaghanClark Mar 29, 2023
Author

nspope Mar 29, 2023
Collaborator

MeaghanClark Mar 31, 2023
Author

jeromekelleher
Mar 30, 2023
Maintainer