Bootstrapping to estimate uncertainty in subpopulation pi #2727
Replies: 2 comments 3 replies
-
Hi! If you're after an estimate of Var(pi) over a large interval, then I'd suggest: calculating (not-span-normalized) pi in windows, resampling these windows with replacement, summing pi over resampled windows, summing window length over resampled windows (to get resampled sequence length), then normalizing. Because pi is a linear statistic, you only need to traverse the tree sequence once. So, something like, import tskit
import msprime
import numpy as np
# fake data
ts = msprime.sim_ancestry(samples=100, sequence_length=10e6, recombination_rate=1e-8, population_size=10000)
# calculate pi in windows (will work for mode='site' too)
num_windows = 100
num_boot = 50
windows = np.linspace(0, ts.sequence_length, num_windows+1)
window_length = np.diff(windows)
pi_raw = ts.diversity(windows=windows, mode='branch', span_normalise=False)
# resampling with replacement is equivalent to multinomial reweighting
# e.g. matrix-vector product where each matrix row is a draw from a uniform multinomial
weights = np.random.multinomial(num_windows, [1/num_windows]*num_windows, size=num_boot)
# matrix multiplication b/w weights and windows gives vector of bootstrap replicates;
# do this for numerator (sum of pi) and denominator (sum of window length)
pi_boot = (weights @ pi_raw) / (weights @ window_length)
np.std(pi_boot) # 691.04
# compare against repeated simulations from "true" model
ts_gen = msprime.sim_ancestry(samples=100, sequence_length=10e6, recombination_rate=1e-8, population_size=10000, num_replicates=50)
pi_sim = [ts.diversity(mode='branch') for ts in ts_gen]
np.std(pi_sim) # 771.44 In this example, windows are the same length, so all the span normalization stuff is unnecessary. However, in practice you might want to calculate the number of accessible sites per window, and use this as "window_length" -- e.g. for each bootstrap replicate, there'd be a different amount of accessible sequence, and it'd be important to correctly adjust for this. If you're after uncertainty estimates for pi on a per-tree level, then I think you'd need to jack-knife samples rather than bootstrap -- to me, resampling sample nodes with replacement seems a bit dubious in this genealogical context. There's probably a way to jack-knife pi efficiently with tree sequences, but I haven't worked it out. |
Beta Was this translation helpful? Give feedback.
-
Just to say here that I don't think there's any deep reason for making this check in the code, and we could easily disable it if it turns out there's a good reason to. |
Beta Was this translation helpful? Give feedback.
-
Hi all,
I'm trying to get an estimate of the uncertainty associated with an estimate of pi in a subpopulation. My current strategy is to bootstrap nodes, sampling with replacement. However, I can't sample the same node twice because diversity() throws the "Duplicate sample value" error. Is there a way to bypass this check? Also open to suggestions of other methods to estimate uncertainty.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions