Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ts.pair_coalescence_rates can give large or infinite rates #3048

Open
hyanwong opened this issue Nov 6, 2024 · 2 comments
Open

ts.pair_coalescence_rates can give large or infinite rates #3048

hyanwong opened this issue Nov 6, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@hyanwong
Copy link
Member

hyanwong commented Nov 6, 2024

Here's an example giving a maximum value of infinity in the returned matrix:

L = 10e6  # 10 Mb
params = {"sequence_length": L, "population_size":1e4, "recombination_rate": 1e-8, "random_seed": 6} 
sweep_model = msprime.SweepGenicSelection(
    position=L/2, start_frequency=0.0001, end_frequency=0.9999, s=0.25, dt=1e-6)
ts = msprime.sim_ancestry(20, model=[sweep_model, msprime.StandardCoalescent()], **params)
time_intervals = np.logspace(0, np.log10(ts.max_time), 20)
# time intervals need to go from 0 -> infinity
time_intervals = np.concatenate(([0], time_intervals, [np.inf]))
genome_intervals = np.linspace(0, L, 21)

rates = ts.pair_coalescence_rates(time_windows=time_intervals, windows=genome_intervals)
np.nanmax(rates)  # Gives np.inf

If you swap the random_seed to 3, you get a maximum rate of 134217728.0, which also seems wrong. These large rates are all in the last timeslice. Presumably we are hitting a division-by-zero or division-by-epsilon error.

@hyanwong
Copy link
Member Author

hyanwong commented Nov 6, 2024

As discussed previously @nspope. I assigned you to this: I hope that's OK.

@nspope
Copy link
Contributor

nspope commented Nov 6, 2024

The issue is that the last time bin starts with the oldest node in the ts. The last bin has a single node in this case, and the corresponding rate estimator is effectively 1/(E[pair time | events > epoch start] - epoch start) , which blows up as only FP error keeps the denominator away from 0. This should be easy enough to catch in practice.

Changing time_intervals = np.logspace(0, np.log10(ts.max_time), 20) to time_intervals = np.logspace(0, np.log10(ts.max_time + 1), 20) gives something much more reasonable. Starting the last epoch at a fixed time just before the final node gives rate estimates that are bogus (e.g. totally dependent on the choice of fixed timepoint), but at least numerically stable.

@nspope nspope added the bug Something isn't working label Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants