Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better divmat site mode #2813

Merged

Conversation

jeromekelleher
Copy link
Member

@jeromekelleher jeromekelleher commented Aug 4, 2023

An improved mutation-by-mutation algorithm for site mode divmat. See #2779 for discussion

Benchmarks:

mode n       trees    muts    stats_api   divmat
site    10      4232    5532    0.00    0.00
site    100     9206    10564   0.25    0.02
site    250     10981   12369   1.85    0.08
site    500     12348   13814   7.75    0.29
site    1000    13347   14731   38.30   1.01

I tried running it for n=5000 also, but ran out of patience. I think it's clear that the new algorithm is a lot better for this regime anyway.

Code:

def lib_divergence_matrix(ts, mode="site"):
    upper_tri = [
        (i, j) for i in range(ts.num_samples) for j in range(ts.num_samples) if j > i
    ]
    divs = ts.divergence(
        [[j] for j in ts.samples()], indexes=upper_tri, mode=mode, span_normalise=False
    )
    out = np.zeros((ts.num_samples, ts.num_samples))
    for (i, j), div in zip(upper_tri, divs):
        out[i, j] = out[j, i] = div
    return out

def compare_perf():
    seed = 1234
    for n in [10, 100, 250, 500, 1000, 5000]:
        for mode in ["site"]:
            ts = msprime.sim_ancestry(
                n,
                ploidy=1,
                population_size=10**4,
                sequence_length=1e7,
                recombination_rate=1e-8,
                random_seed=seed,
            )
            ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=seed)
            before = time.perf_counter()
            D1 = lib_divergence_matrix(ts, mode=mode)
            time_lib = time.perf_counter() - before
            before = time.perf_counter()
            D2 = ts.divergence_matrix(mode=mode)
            time_nb = time.perf_counter() - before
            # assert np.allclose(D1, D2)

            print(
                mode,
                n,
                ts.num_trees,
                ts.num_mutations,
                f"{time_lib:.2f}",
                f"{time_nb:.2f}",
                sep="\t",
            )

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Aug 4, 2023

I tried running this on the unified genealogy trees for chr5 q (hgdp_tgp_sgdp_high_cov_ancients_chr5_q.dated.trees). Details:

╔═════════════════════════╗
║TreeSequence             ║
╠═══════════════╤═════════╣
║Trees          │   235014║
╟───────────────┼─────────╢
║Sequence Length│181538261║
╟───────────────┼─────────╢
║Time Units     │  unknown║
╟───────────────┼─────────╢
║Sample Nodes   │     7524║
╟───────────────┼─────────╢
║Total Size     │  1.1 GiB║
╚═══════════════╧═════════╝
╔═══════════╤════════╤═════════╤════════════╗
║Table      │Rows    │Size     │Has Metadata║
╠═══════════╪════════╪═════════╪════════════╣
║Edges      │ 9921165│302.8 MiB│          No║
╟───────────┼────────┼─────────┼────────────╢
║Individuals│    3762│  1.1 MiB│         Yes║
╟───────────┼────────┼─────────┼────────────╢
║Migrations │       0│  8 Bytes│          No║
╟───────────┼────────┼─────────┼────────────╢
║Mutations  │11000819│388.2 MiB│          No║
╟───────────┼────────┼─────────┼────────────╢
║Nodes      │ 1217256│ 62.2 MiB│         Yes║
╟───────────┼────────┼─────────┼────────────╢
║Populations│     215│ 11.9 KiB│         Yes║
╟───────────┼────────┼─────────┼────────────╢
║Provenances│      18│ 12.6 KiB│          No║
╟───────────┼────────┼─────────┼────────────╢
║Sites      │ 4325444│337.4 MiB│         Yes║
╚═══════════╧════════╧═════════╧════════════╝

For one thread it needs 2.7G of RAM. Running with 8 threads needed 6G of RAM (reasonable, since we're creating 8 n*n matrices). The old stats API based code quickly ran out of memory on my machine (32G RAM).

Finished up after about 3 hours:

47812.58user 208.43system 3:01:39elapsed

@jeromekelleher jeromekelleher marked this pull request as ready for review August 4, 2023 09:11
@jeromekelleher
Copy link
Member Author

Ugggghhhh, looks like there's some memory access problem. Digging.

@jeromekelleher
Copy link
Member Author

OK, fixed the memory error, ready for review!

@jeromekelleher jeromekelleher marked this pull request as draft August 5, 2023 10:11
@jeromekelleher
Copy link
Member Author

And here's an experimental variant-based method. Has basically the same performance (some wrinkles to be worked out with multi-allelics)

@jeromekelleher
Copy link
Member Author

I've updated this to include the correct algorithm for dealing with multi-allelic sites, to give the same answer as the stats API version. This will be easy to implement in C, if we decide to make the change.

What do you think @petrelharp? Unless we can see a clear way forward for making an efficient version counting mutations on the paths, I'm not sure there's much point in creating confusion here by using the number-of-mutations version for site-mode.

@jeromekelleher
Copy link
Member Author

Just had a chat with @petrelharp and we decided to change the definition to exactly match the current stats API.

@jeromekelleher jeromekelleher force-pushed the better-divmat-site-mode branch 2 times, most recently from 8c2861e to ea6a53f Compare August 14, 2023 14:00
@jeromekelleher jeromekelleher marked this pull request as ready for review August 14, 2023 14:00
@jeromekelleher
Copy link
Member Author

I think this is basically ready to go @petrelharp (although I need to rerun the benchmarks)

@petrelharp
Copy link
Contributor

LGTM! Things I have noticed are pending are

  • test when first window is not at 0 (at least codecov thinks this isn't)
  • fix up the docstring (see the old version)

Also, I think we should add the span_normalise argument (and probably have it default to True, as usual). (Here's how we do it elsewhere.) The reason is that, if I've got this right, currently the code is computing the non-span-normalized version (ie just counting mutations);l however, "divergence" usually means "number of mutations divided by sequence length", i.e., span normalized. But we'll have need for the "just count mutations" version, so rather than do error-prone post-hoc multiplying by window lengths to get it out, we should just add the argument.

@jeromekelleher
Copy link
Member Author

Great, thanks @petrelharp! Good spots on all counts. I think we can merge this much for now, and follow up with more PRs.

Looks like this version is a small bit slower than the last version (~20%) but I don't think that's worth worrying about.

I logged issues to track the remaining stuff I think:

@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Aug 15, 2023
@jeromekelleher
Copy link
Member Author

Manually merging because we're missing a formal "approval"

@jeromekelleher jeromekelleher merged commit 7190c2d into tskit-dev:main Aug 15, 2023
18 of 19 checks passed
@jeromekelleher jeromekelleher deleted the better-divmat-site-mode branch August 15, 2023 08:46
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants