should post-SLiM simplify have `keep_input_roots=True`? #1579

petrelharp · 2024-07-25T23:07:24Z

After we run SLiM, we simplify down to the requested samples, here:
https://github.com/popsim-consortium/stdpopsim/blob/main/stdpopsim/slim_engine.py#L1764

We do not pass in the keep_input_roots argument. This means that:

any neutral substitutions that happened during the SLiM portion of the simulation will not be present (as these are added by msprime post-simplification)
the tree sequence is maybe less good for some downstream applications? this is not so much of an issue because we already recapitate, so I can't actually think of any concrete applications really

Numbers of substitions are important to know sometimes, but these would be substitutions relative to the time we started running SLiM from which might not be the right reference time. So, it's not so clear to me.

Cons: this will make the resulting tree sequence bigger, and confuse people (since now the "roots" of the resulting trees won't actually be the roots-where-everything-coalesces).

Given the last point I'm inclined to say "no", or maybe make it optional.

The text was updated successfully, but these errors were encountered:

andrewkern · 2024-07-25T23:09:54Z

for context: @silastittes and I ran in to this trying to debug DFE inference results which required the number of neutral and non-neutral substitutions

silastittes · 2024-07-25T23:15:19Z

This is for my own understanding, but does that make the root length = total sim time minus each tree's TMRCA?

nspope · 2024-07-26T00:06:15Z

This is for my own understanding, but does that make the root length = total sim time minus each tree's TMRCA?

Yup, that should be the case. So you could retroactively calculate the neutral substitution rate given the total simulation time + simplified trees. This would be something like,

exp_neutral_subs = 0.0
for t in ts.trees():
    if t.num_edges > 0:
       exp_neutral_subs += max(0, (slim_simulation_time - t.root) * t.span * neutral_mutation_rate)

though this would need to be modified slightly if the neutral mutation rate varied across the sequence, I guess. Assuming that the mutation rates are in an msprime.RateMap, you'd replace t.span * neutral_mutation_rate with neutral_ratemap.get_cumulative_mass(t.interval.right) - neutral_ratemap.get_cumulative_mass(t.interval.left)

nspope · 2024-07-26T00:09:18Z

Given the last point I'm inclined to say "no"

That's my vote, too -- for one-off debugging it seems easier to modify a fork to get whatever's needed, rather than introduce another option + even more SLiM tests ...

andrewkern · 2024-07-26T04:08:48Z

it's probably easiest for @silastittes and I to just calculate the expected value within the analysis2 pipeline presently, but I do think keep_input_roots would be a good option to have.

nspope · 2024-07-30T19:12:16Z

Now that I understand the application, I think it's a good idea to be able to get number of subs on branch to an arbitrarily distant outgroup somehow. But, I'm not sure I think that a keep_input_roots argument is the way to do this, because it couples SLiM burnin and the divergence time of the outgroup. For example, if there's selection, then we'd want some burnin before the separation with the outgroup because otherwise the stochastic process generating substitutions will take some time to reach stationarity.

The ideal way would be to have the outgroup as a population in the model. But, that's not possible to enforce generally and seems really restrictive-- the desired divergence time to the outgroup might change depending on the application. An alternative would be to have an option to add an outgroup at an arbitrary time (e.g. a population with no migration to the other populations, a fixed size, and a divergence time that exceeds any of the demographic events in the model).

So for example,

model = HomSap.get_demographic_model("whatever")
model.add_outgroup(name="some_hominid", time=long_long_ago, size=not_very_big)
samples = {..., "some_hominid" : 1}
engine.simulate(demographic_model=model, samples=samples, ...)

The downside is that this increases computational overhead? But that could be largely avoided by setting the outgroup population size to 1.

andrewkern · 2024-07-30T22:45:00Z

this is probably the way to go, but would need a pretty major refactor for the code. I think the thing to do is to shelve this for the coming paper/release and revisit at a later time.

nspope · 2024-07-30T22:58:39Z

I think we'd just need to add a method to DemographicModel which wouldn't have that big a footprint, but it'd definitely require a lot of testing and back-and-forth (and I'm sure there's issues I'm not thinking of); so yeah not going to happen quickly.

silastittes · 2024-07-30T23:06:11Z

My naive (and almost certainly wrong) take on this is that the arbitrary decision of where to place the root time is all that matters. The length of that branch is what dictates the number of substitutions right? Does there need to be any information about the outgroup samples at all for this purpose? I agree that being forced to use the slim burn in time doesn’t make sense, and is demonstrably too short in some cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should post-SLiM simplify have `keep_input_roots=True`? #1579

should post-SLiM simplify have `keep_input_roots=True`? #1579

petrelharp commented Jul 25, 2024

andrewkern commented Jul 25, 2024

silastittes commented Jul 25, 2024 •

edited

Loading

nspope commented Jul 26, 2024 •

edited

Loading

nspope commented Jul 26, 2024

andrewkern commented Jul 26, 2024

nspope commented Jul 30, 2024 •

edited

Loading

andrewkern commented Jul 30, 2024

nspope commented Jul 30, 2024

silastittes commented Jul 30, 2024

should post-SLiM simplify have keep_input_roots=True? #1579

should post-SLiM simplify have keep_input_roots=True? #1579

Comments

petrelharp commented Jul 25, 2024

andrewkern commented Jul 25, 2024

silastittes commented Jul 25, 2024 • edited Loading

nspope commented Jul 26, 2024 • edited Loading

nspope commented Jul 26, 2024

andrewkern commented Jul 26, 2024

nspope commented Jul 30, 2024 • edited Loading

andrewkern commented Jul 30, 2024

nspope commented Jul 30, 2024

silastittes commented Jul 30, 2024

should post-SLiM simplify have `keep_input_roots=True`? #1579

should post-SLiM simplify have `keep_input_roots=True`? #1579

silastittes commented Jul 25, 2024 •

edited

Loading

nspope commented Jul 26, 2024 •

edited

Loading

nspope commented Jul 30, 2024 •

edited

Loading