Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should post-SLiM simplify have keep_input_roots=True? #1579

Open
petrelharp opened this issue Jul 25, 2024 · 9 comments
Open

should post-SLiM simplify have keep_input_roots=True? #1579

petrelharp opened this issue Jul 25, 2024 · 9 comments

Comments

@petrelharp
Copy link
Contributor

After we run SLiM, we simplify down to the requested samples, here:
https://github.com/popsim-consortium/stdpopsim/blob/main/stdpopsim/slim_engine.py#L1764

We do not pass in the keep_input_roots argument. This means that:

  • any neutral substitutions that happened during the SLiM portion of the simulation will not be present (as these are added by msprime post-simplification)
  • the tree sequence is maybe less good for some downstream applications? this is not so much of an issue because we already recapitate, so I can't actually think of any concrete applications really

Numbers of substitions are important to know sometimes, but these would be substitutions relative to the time we started running SLiM from which might not be the right reference time. So, it's not so clear to me.

Cons: this will make the resulting tree sequence bigger, and confuse people (since now the "roots" of the resulting trees won't actually be the roots-where-everything-coalesces).

Given the last point I'm inclined to say "no", or maybe make it optional.

@andrewkern
Copy link
Member

for context: @silastittes and I ran in to this trying to debug DFE inference results which required the number of neutral and non-neutral substitutions

@silastittes
Copy link
Contributor

silastittes commented Jul 25, 2024

This is for my own understanding, but does that make the root length = total sim time minus each tree's TMRCA?

@nspope
Copy link
Collaborator

nspope commented Jul 26, 2024

This is for my own understanding, but does that make the root length = total sim time minus each tree's TMRCA?

Yup, that should be the case. So you could retroactively calculate the neutral substitution rate given the total simulation time + simplified trees. This would be something like,

exp_neutral_subs = 0.0
for t in ts.trees():
    if t.num_edges > 0:
       exp_neutral_subs += max(0, (slim_simulation_time - t.root) * t.span * neutral_mutation_rate)

though this would need to be modified slightly if the neutral mutation rate varied across the sequence, I guess. Assuming that the mutation rates are in an msprime.RateMap, you'd replace t.span * neutral_mutation_rate with neutral_ratemap.get_cumulative_mass(t.interval.right) - neutral_ratemap.get_cumulative_mass(t.interval.left)

@nspope
Copy link
Collaborator

nspope commented Jul 26, 2024

Given the last point I'm inclined to say "no"

That's my vote, too -- for one-off debugging it seems easier to modify a fork to get whatever's needed, rather than introduce another option + even more SLiM tests ...

@andrewkern
Copy link
Member

it's probably easiest for @silastittes and I to just calculate the expected value within the analysis2 pipeline presently, but I do think keep_input_roots would be a good option to have.

@nspope
Copy link
Collaborator

nspope commented Jul 30, 2024

Now that I understand the application, I think it's a good idea to be able to get number of subs on branch to an arbitrarily distant outgroup somehow. But, I'm not sure I think that a keep_input_roots argument is the way to do this, because it couples SLiM burnin and the divergence time of the outgroup. For example, if there's selection, then we'd want some burnin before the separation with the outgroup because otherwise the stochastic process generating substitutions will take some time to reach stationarity.

The ideal way would be to have the outgroup as a population in the model. But, that's not possible to enforce generally and seems really restrictive-- the desired divergence time to the outgroup might change depending on the application. An alternative would be to have an option to add an outgroup at an arbitrary time (e.g. a population with no migration to the other populations, a fixed size, and a divergence time that exceeds any of the demographic events in the model).

So for example,

model = HomSap.get_demographic_model("whatever")
model.add_outgroup(name="some_hominid", time=long_long_ago, size=not_very_big)
samples = {..., "some_hominid" : 1}
engine.simulate(demographic_model=model, samples=samples, ...)

The downside is that this increases computational overhead? But that could be largely avoided by setting the outgroup population size to 1.

@andrewkern
Copy link
Member

this is probably the way to go, but would need a pretty major refactor for the code. I think the thing to do is to shelve this for the coming paper/release and revisit at a later time.

@nspope
Copy link
Collaborator

nspope commented Jul 30, 2024

I think we'd just need to add a method to DemographicModel which wouldn't have that big a footprint, but it'd definitely require a lot of testing and back-and-forth (and I'm sure there's issues I'm not thinking of); so yeah not going to happen quickly.

@silastittes
Copy link
Contributor

My naive (and almost certainly wrong) take on this is that the arbitrary decision of where to place the root time is all that matters. The length of that branch is what dictates the number of substitutions right? Does there need to be any information about the outgroup samples at all for this purpose? I agree that being forced to use the slim burn in time doesn’t make sense, and is demonstrably too short in some cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants