Add `sim` top level function #1540

jeromekelleher · 2021-03-16T10:06:14Z

It would be handy to have a top-level function that does simple simulations of ancestry and mutations. How about we do something like

def sim(
    samples, 
    *,  
    sequence_length=None, 
    recombination_rate=None, 
    mutation_rate=None, 
    demography=None,  
    discrete_genome=None, 
    random_seed=None
):
    seeds = generate_seeds(random_seed)
    ts = sim_ancestry(
        samples, sequence_length=sequence_length, 
        recombination_rate=recombination_rate, 
        demography=demography, 
        discrete_genome=discrete_genome,
        random_seed=seeds[0])
    return sim_mutations(ts, rate=mutation_rate, random_seed=seeds[1], discrete_genome=discrete_genome)

We can add things like ancestry_model and mutation_model etc as time goes by, but this much would satisfy 99% of the use cases I think (which is to do quick, simple simulations).

If you want to do complex stuff like specify the initial_state etc, then you need to use sim_ancestry directly.

What do we think?

@hyanwong, you've been asking for this for a while - can you link to the issues involved please?

The text was updated successfully, but these errors were encountered:

benjeffery · 2021-03-16T11:18:55Z

I'm not sure what the gain is here - if there were more shared arguments it would make sense. To me it feels like you're hiding the key concept of mutations being placed after simulation of ancestry to just lose one line of code?

jeromekelleher · 2021-03-16T12:40:16Z

Fair point. Just seems like a handy way of avoiding a bit of typing for common tasks, but maybe it would cause confusion.

petrelharp · 2021-03-16T12:51:23Z

I'm in favor of this - it'd be nice for the casual/new user to not have to understand the point about mutations coming after ancestry. But as Ben says, it's just one line of code...

grahamgower · 2021-03-16T13:07:30Z

I'm in favor of this - it'd be nice for the casual/new user to not have to understand the point about mutations coming after ancestry.

I think forcing the user to understand this point is actually beneficial. The separation of sim_ancestry()/sim_mutations() makes it more explicit to users that the simulated tree(s) are necessarily independent of the mutations.

Plus, the zen of Python says:

There should be one-- and preferably only one --obvious way to do it.

jeromekelleher · 2021-03-16T13:12:57Z

Well, two dissenting opinions is enough for me to not be motivated enough to do it for 1.0. Let's see how we go without it, and put it in later if it's something we miss having.

hyanwong · 2021-03-16T14:54:59Z

Well, two dissenting opinions is enough for me to not be motivated enough to do it for 1.0. Let's see how we go without it, and put it in later if it's something we miss having.

Sure. We can add it later if others prefer. My main reason is that I often use msprime for generating a test tree sequence, e.g.
msprime.simulate(10, mutation_rate=1, random_seed=1) to mess about with. It's more of a hassle to do with 2 calls, and (more worryingly) if you want a deterministic result, it's easy to accidentally specify the random_seed value for only one of the functions.

But I can continue to use .simulate() for this. Normally when I'm doing it, I don't actually care what the mutation model is anyway, or where recombinations occur. So I'm happy with what others think best.

I think if you call it sim_ancestry_and_mutations it partially negates @grahamgower 's point. But it makes it long-winded to type, which isn't so good for my quick-and-dirty-use-case example.

jeromekelleher · 2021-03-16T16:36:41Z

(@hyanwong - do you have links to the other threads where this has come up? I couldn't find them)

hyanwong · 2021-03-16T17:24:28Z

(@hyanwong - do you have links to the other threads where this has come up? I couldn't find them)

I think it was almost entirely discussed on Slack. It's mentioned in passing at #1119 (comment) but I couldn't find other refs.

hyanwong · 2021-03-24T15:23:30Z

I'm in the middle of doing something that illustrates a use-case for a top-level simXXX function, so thought I should post it here:

import itertools
import msprime
times = []
start = time.time()
for seed in itertools.count(1):
    ts = msprime.sim_mutations(
        msprime.sim_ancestry(1, sequence_length=100, population_size=1000, random_seed=seed),
        rate=1e-5, random_seed=seed)
    # rejection sampling: only accept those with a single mutation above the RH node
    if ts.num_mutations == 1 and ts.mutation(0).node == 1:
        times.append(ts.node(2).time)
    if seed % 1000 == 0:
        print(seed, len(times), "in", time.time()-start, "seconds")
    if len(times) > 1e4:
        break

You can see the repeat of random_seed is a bit annoying. But perhaps I should just use simulate() here anyway, as the combining sim_ancestry and sim_mutations results in a loop that runs at about half the speed of a simple simulate call (perhaps that's because the new version is also creating an individual each time).

jeromekelleher · 2021-03-24T16:55:59Z

Interesting. I wouldn't see much point in actually setting seeds here, though. Running replicates via num_replicates might be a bit more efficient too, although this is such a tiny simulation that it's all going to be overheads anyway.

hyanwong · 2021-03-24T17:09:35Z

Yeah, thanks. It is a lot quicker using num_replicates. True about setting the seed, although it's nice to be able to make it replicable.

jeromekelleher · 2021-07-28T07:46:16Z

I think there's probably less reason for this if we have #1786?

jeromekelleher added this to the 1.1 milestone Apr 13, 2021

jeromekelleher mentioned this issue Apr 13, 2021

Give idiomatic example of replicates with mutations #1648

Closed

jeromekelleher mentioned this issue Jul 28, 2021

Accept iterators as input in sim_mutations #1786

Open

jeromekelleher modified the milestones: 1.2.0, 1.3.0 Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `sim` top level function #1540

Add `sim` top level function #1540

jeromekelleher commented Mar 16, 2021 •

edited

Loading

benjeffery commented Mar 16, 2021

jeromekelleher commented Mar 16, 2021

petrelharp commented Mar 16, 2021

grahamgower commented Mar 16, 2021

jeromekelleher commented Mar 16, 2021

hyanwong commented Mar 16, 2021

jeromekelleher commented Mar 16, 2021

hyanwong commented Mar 16, 2021

hyanwong commented Mar 24, 2021 •

edited

Loading

jeromekelleher commented Mar 24, 2021 •

edited

Loading

hyanwong commented Mar 24, 2021

jeromekelleher commented Jul 28, 2021

Add sim top level function #1540

Add sim top level function #1540

Comments

jeromekelleher commented Mar 16, 2021 • edited Loading

benjeffery commented Mar 16, 2021

jeromekelleher commented Mar 16, 2021

petrelharp commented Mar 16, 2021

grahamgower commented Mar 16, 2021

jeromekelleher commented Mar 16, 2021

hyanwong commented Mar 16, 2021

jeromekelleher commented Mar 16, 2021

hyanwong commented Mar 16, 2021

hyanwong commented Mar 24, 2021 • edited Loading

jeromekelleher commented Mar 24, 2021 • edited Loading

hyanwong commented Mar 24, 2021

jeromekelleher commented Jul 28, 2021

Add `sim` top level function #1540

Add `sim` top level function #1540

jeromekelleher commented Mar 16, 2021 •

edited

Loading

hyanwong commented Mar 24, 2021 •

edited

Loading

jeromekelleher commented Mar 24, 2021 •

edited

Loading