Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move split_upwards() to tskit? #27

Open
gregorgorjanc opened this issue Oct 7, 2024 · 15 comments
Open

Move split_upwards() to tskit? #27

gregorgorjanc opened this issue Oct 7, 2024 · 15 comments

Comments

@gregorgorjanc
Copy link
Collaborator

We don't necessarily need split_upwards() anymore with the new matvec (tree-seq-based-GRM-times-a-vec) algorithm. However, it's good to have this functionality for testing etc.

I am working with @jeromekelleher on adding edge effect, edge value, and node value functionality to tstrait (see tskit-dev/tstrait#155). As part of testing this work, I am finding it handy to have the split_upwards(), which makes me wonder if split_upwards() should be in tskit (instead in tslmm).

Thoughts?

@jeromekelleher
Copy link
Collaborator

It sounds like something that should be in tskit, I guess the only reason we put it here is so we wouldn't need to do so much testing and worrying about corner cases

@jeromekelleher
Copy link
Collaborator

Should probably rename to split_pastwards or something also?

@petrelharp
Copy link
Collaborator

I think it's going to be totally used by people - it's shown up in three or four different people's work independently.

In testing, the thing to check is whether for each edge, the subtree below that edge doesn't change.

@jeromekelleher
Copy link
Collaborator

Do we have a tskit ready implementation?

@nspope
Copy link
Collaborator

nspope commented Oct 7, 2024

Seems like a good idea to me.

@jeromekelleher -- the closest we have is the numba version you wrote (using the tree position jitclass)

def split_upwards_numba(ts):

so this would need to be moved to C I suppose

@jeromekelleher
Copy link
Collaborator

That would be a fair bit of work to port to C

@nspope
Copy link
Collaborator

nspope commented Oct 7, 2024

I can port it over, will be next month at the earliest though. We could put in a pure-python method if it seems convenient enough to have around in the meantime

@jeromekelleher
Copy link
Collaborator

I'm in no hurry! Pure Python would be too slow to be useful.

@gregorgorjanc
Copy link
Collaborator Author

The existing numba version would not be considered for tskit because you aim for both C and Python API @jeromekelleher?

@jeromekelleher
Copy link
Collaborator

Tskit is quite strict about adding dependencies (for various reasons) and we don't currently have a dependency on numba. In practise, numba is quite a tricky dependency, so we would need very good reason indeed to add it to tskit.

@hanbin973
Copy link
Owner

Another issue is that the current numba implementation isn't faster than the python version.

The code:

def make_ts(num_samples, recombination_rate=1e-8, sequence_length=1e6, population_size=50_000):
    ts = msprime.sim_ancestry(
        samples=num_samples,
        recombination_rate=recombination_rate,
        sequence_length=sequence_length,
        model=msprime.StandardCoalescent(),
        population_size=population_size,
        ploidy=1
    )
    return ts

def measure_time(ts, num_iter):
    times = []
    for i in range(num_iter):
        t1 = time.time()
        operations.split_upwards(ts)
        t2 = time.time()
        times.append(t2-t1)
    return times

def measure_time_numba(ts, num_iter):
    times = []
    for i in range(num_iter):
        t1 = time.time()
        operations.split_upwards_numba(ts)
        t2 = time.time()
        times.append(t2-t1)
    return times

num_iter = 10
ns = np.linspace(2, 10002, 21).astype(int)
times = np.zeros((len(ns), num_iter))
times_numba = np.zeros((len(ns), num_iter))
for i, n in enumerate(ns):
    ts = make_ts(n)
    times[i,:] = measure_time(ts, num_iter)
    times_numba[i,:] = measure_time_numba(ts, num_iter)


The result:
image

@nspope
Copy link
Collaborator

nspope commented Oct 8, 2024

Hm, probably njit can't compile away all the python bits so it's spending most of its time in the python interpreter? Another argument in favor of getting it into C, I guess

@jeromekelleher
Copy link
Collaborator

This is going to be significant work to add to tskit - I'd vote for focusing on other bigger impact stuff.

@hanbin973
Copy link
Owner

hanbin973 commented Oct 8, 2024

I agree with Jerome. In small tree sequences for testing, python ver. is already pretty fast unless the population structure is complicated. For big sequences, we're not gonna use it anyway because of the memory demand.
The Q. is that would be helpful to have the python version on tskit? I can imagine using the function to demonstrate why sharing adjacent edges and not splitting them helps computation.

@petrelharp
Copy link
Collaborator

One possibility would be to write a short thing for the docs about the operation and give the python code there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants