Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sim top level function #1540

Open
jeromekelleher opened this issue Mar 16, 2021 · 12 comments
Open

Add sim top level function #1540

jeromekelleher opened this issue Mar 16, 2021 · 12 comments
Milestone

Comments

@jeromekelleher
Copy link
Member

jeromekelleher commented Mar 16, 2021

It would be handy to have a top-level function that does simple simulations of ancestry and mutations. How about we do something like

def sim(
    samples, 
    *,  
    sequence_length=None, 
    recombination_rate=None, 
    mutation_rate=None, 
    demography=None,  
    discrete_genome=None, 
    random_seed=None
):
    seeds = generate_seeds(random_seed)
    ts = sim_ancestry(
        samples, sequence_length=sequence_length, 
        recombination_rate=recombination_rate, 
        demography=demography, 
        discrete_genome=discrete_genome,
        random_seed=seeds[0])
    return sim_mutations(ts, rate=mutation_rate, random_seed=seeds[1], discrete_genome=discrete_genome)

We can add things like ancestry_model and mutation_model etc as time goes by, but this much would satisfy 99% of the use cases I think (which is to do quick, simple simulations).

If you want to do complex stuff like specify the initial_state etc, then you need to use sim_ancestry directly.

What do we think?

@hyanwong, you've been asking for this for a while - can you link to the issues involved please?

@benjeffery
Copy link
Member

I'm not sure what the gain is here - if there were more shared arguments it would make sense. To me it feels like you're hiding the key concept of mutations being placed after simulation of ancestry to just lose one line of code?

@jeromekelleher
Copy link
Member Author

Fair point. Just seems like a handy way of avoiding a bit of typing for common tasks, but maybe it would cause confusion.

@petrelharp
Copy link
Contributor

I'm in favor of this - it'd be nice for the casual/new user to not have to understand the point about mutations coming after ancestry. But as Ben says, it's just one line of code...

@grahamgower
Copy link
Member

I'm in favor of this - it'd be nice for the casual/new user to not have to understand the point about mutations coming after ancestry.

I think forcing the user to understand this point is actually beneficial. The separation of sim_ancestry()/sim_mutations() makes it more explicit to users that the simulated tree(s) are necessarily independent of the mutations.

Plus, the zen of Python says:

There should be one-- and preferably only one --obvious way to do it.

@jeromekelleher
Copy link
Member Author

Well, two dissenting opinions is enough for me to not be motivated enough to do it for 1.0. Let's see how we go without it, and put it in later if it's something we miss having.

@hyanwong
Copy link
Member

Well, two dissenting opinions is enough for me to not be motivated enough to do it for 1.0. Let's see how we go without it, and put it in later if it's something we miss having.

Sure. We can add it later if others prefer. My main reason is that I often use msprime for generating a test tree sequence, e.g.
msprime.simulate(10, mutation_rate=1, random_seed=1) to mess about with. It's more of a hassle to do with 2 calls, and (more worryingly) if you want a deterministic result, it's easy to accidentally specify the random_seed value for only one of the functions.

But I can continue to use .simulate() for this. Normally when I'm doing it, I don't actually care what the mutation model is anyway, or where recombinations occur. So I'm happy with what others think best.

I think if you call it sim_ancestry_and_mutations it partially negates @grahamgower 's point. But it makes it long-winded to type, which isn't so good for my quick-and-dirty-use-case example.

@jeromekelleher
Copy link
Member Author

(@hyanwong - do you have links to the other threads where this has come up? I couldn't find them)

@hyanwong
Copy link
Member

(@hyanwong - do you have links to the other threads where this has come up? I couldn't find them)

I think it was almost entirely discussed on Slack. It's mentioned in passing at #1119 (comment) but I couldn't find other refs.

@hyanwong
Copy link
Member

hyanwong commented Mar 24, 2021

I'm in the middle of doing something that illustrates a use-case for a top-level simXXX function, so thought I should post it here:

import itertools
import msprime
times = []
start = time.time()
for seed in itertools.count(1):
    ts = msprime.sim_mutations(
        msprime.sim_ancestry(1, sequence_length=100, population_size=1000, random_seed=seed),
        rate=1e-5, random_seed=seed)
    # rejection sampling: only accept those with a single mutation above the RH node
    if ts.num_mutations == 1 and ts.mutation(0).node == 1:
        times.append(ts.node(2).time)
    if seed % 1000 == 0:
        print(seed, len(times), "in", time.time()-start, "seconds")
    if len(times) > 1e4:
        break

You can see the repeat of random_seed is a bit annoying. But perhaps I should just use simulate() here anyway, as the combining sim_ancestry and sim_mutations results in a loop that runs at about half the speed of a simple simulate call (perhaps that's because the new version is also creating an individual each time).

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Mar 24, 2021

Interesting. I wouldn't see much point in actually setting seeds here, though. Running replicates via num_replicates might be a bit more efficient too, although this is such a tiny simulation that it's all going to be overheads anyway.

@hyanwong
Copy link
Member

Yeah, thanks. It is a lot quicker using num_replicates. True about setting the seed, although it's nice to be able to make it replicable.

@jeromekelleher
Copy link
Member Author

I think there's probably less reason for this if we have #1786?

@jeromekelleher jeromekelleher modified the milestones: 1.2.0, 1.3.0 Feb 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants