Illogical Lineage Counts in algorithms.py #2242

agushin101 · 2023-12-12T21:08:33Z

sims.ipynb.zip

Above is a Notebook describing the issue in more detail. The summary is that: when running basic selective sweeps using algorithms.py's framework, there is some illogical output. The sweep_pop_sizes array, which I've interpreted to represent the lineage counts for wild type and mutant respectively, does not always converge. When the --population-sizes flag is not specified, it never converges to 1 lineage. When it is specified, it sometimes converges and sometimes does not, getting stuck at 2 mutant lineages, even when recombination is set to 0. I was wondering if someone could clear up why this is happening.

Thanks!

jeromekelleher · 2023-12-13T10:31:16Z

Thanks @agushin101, I think @GertjanBisschop is taking a look. This is possibly related to #2217 - we didn't do a very good job of keeping the Python prototype up to date with the C implementation as bugs got identified and fixed, and there are inconsistencies between the two.

GertjanBisschop · 2023-12-14T15:00:39Z

Thanks @agushin101, this is insightful! This definitely exposes a couple of issues.
The impact of the sweep is clearly not as strong as it should be when the population size is specified. This can be rectified. I will post a draft pull-request that fixes two issues: 1. the expected duration of the sweep in algorithms.py, 2. modifications to the rate of coalescence vs recombination logic.
Do pay attention though to the recombination rate you specify, this is a per base per generation rate, not the population-scaled mutation rate. So in a way it makes sense that the number of lineages actually increases. The actual rate should probably be about 20 000 times smaller.
I am not sure yet whether this fixes all discrepancies between both implementations, but I am sure we'll get there in the end.

YuanboSong · 2023-12-14T18:05:27Z

We adjust the recombination rate by dividing it by the product of the population size and the segment length, i.e., 0.1 / (population size * segment length). However, there are no recombination events observed during the sweep. If everything runs as expected, what should the sweep population size be at the beginning? In some runs, I observed it became [0,0]. Shouldn't it be [0,1], considering there should be a beneficial ancestor at the start of the sweep? Thank you for your time.

GertjanBisschop · 2023-12-14T20:32:50Z

The simulation is a structured coalescent based on a pre-simulated allele frequency trajectory. This does not guarantee that at the end of the simulated frequency trajectory everything will have coalesced. Similarly, all samples might have coalesced before the end of the simulated trajectory. Simply by chance you might have drawn some samples that coalesced right end the end of the sweep (so in the beginning when looked backwards in time).

YuanboSong · 2023-12-15T00:14:37Z

Thank you for the insights. I understand the simulation is based on a structured coalescent model with a pre-simulated allele frequency trajectory. However, my concern is specifically about the initial conditions of the sweep. We deliberately start the sweep with only one beneficial allele, implying that all lineages should eventually coalesce into this single ancestor. If the simulation sometimes ends with more than one lineage remaining, doesn't this indicate a potential issue in the coalescent function or the way it's handling these initial conditions? It seems like the simulation should consistently converge to a single lineage in this scenario, given the starting premise of a single beneficial allele.

jeromekelleher · 2023-12-15T09:26:03Z

If the simulation sometimes ends with more than one lineage remaining, doesn't this indicate a potential issue in the coalescent function or the way it's handling these initial conditions? It seems like the simulation should consistently converge to a single lineage in this scenario, given the starting premise of a single beneficial allele.

This is something I've been confused about too, FWIW!

GertjanBisschop · 2023-12-15T10:02:24Z

Yup, I get your point, this is somewhat confusing. The original Braverman paper on which this implementation is based gives some insight though: "The deterministic frequency trajectory starts at frequency 1 - E and ends at E . KAPLAN et al. (1989) chose E= 5 / (2Nes) based on their calculation of the point at which the probability of extinction of the new mutant due to stochastic effects becomes approximately zero". This means that if the start and end frequency we provide become too close to 0 (and 1 respectively), then we ignore the stochastic effects of the sweep. This is where our approximation no longer holds and so we shouldn't interpret the start frequency as a strict bound.
From a more practical point of view, we do expect the distribution of the number of neutral lineages with the favoured allele to be very centred around 1 in case of a hard sweep, but it will never be always exactly one.

GertjanBisschop mentioned this issue Dec 14, 2023

Syncing sweeps implementation in algorithms.py #2247

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Illogical Lineage Counts in algorithms.py #2242

Illogical Lineage Counts in algorithms.py #2242

agushin101 commented Dec 12, 2023

jeromekelleher commented Dec 13, 2023

GertjanBisschop commented Dec 14, 2023

YuanboSong commented Dec 14, 2023

GertjanBisschop commented Dec 14, 2023

YuanboSong commented Dec 15, 2023

jeromekelleher commented Dec 15, 2023

GertjanBisschop commented Dec 15, 2023

Illogical Lineage Counts in algorithms.py #2242

Illogical Lineage Counts in algorithms.py #2242

Comments

agushin101 commented Dec 12, 2023

jeromekelleher commented Dec 13, 2023

GertjanBisschop commented Dec 14, 2023

YuanboSong commented Dec 14, 2023

GertjanBisschop commented Dec 14, 2023

YuanboSong commented Dec 15, 2023

jeromekelleher commented Dec 15, 2023

GertjanBisschop commented Dec 15, 2023