Better divmat site mode #2813

jeromekelleher · 2023-08-04T08:34:41Z

An improved mutation-by-mutation algorithm for site mode divmat. See #2779 for discussion

Benchmarks:

mode n       trees    muts    stats_api   divmat
site    10      4232    5532    0.00    0.00
site    100     9206    10564   0.25    0.02
site    250     10981   12369   1.85    0.08
site    500     12348   13814   7.75    0.29
site    1000    13347   14731   38.30   1.01

I tried running it for n=5000 also, but ran out of patience. I think it's clear that the new algorithm is a lot better for this regime anyway.

Code:

def lib_divergence_matrix(ts, mode="site"):
    upper_tri = [
        (i, j) for i in range(ts.num_samples) for j in range(ts.num_samples) if j > i
    ]
    divs = ts.divergence(
        [[j] for j in ts.samples()], indexes=upper_tri, mode=mode, span_normalise=False
    )
    out = np.zeros((ts.num_samples, ts.num_samples))
    for (i, j), div in zip(upper_tri, divs):
        out[i, j] = out[j, i] = div
    return out

def compare_perf():
    seed = 1234
    for n in [10, 100, 250, 500, 1000, 5000]:
        for mode in ["site"]:
            ts = msprime.sim_ancestry(
                n,
                ploidy=1,
                population_size=10**4,
                sequence_length=1e7,
                recombination_rate=1e-8,
                random_seed=seed,
            )
            ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=seed)
            before = time.perf_counter()
            D1 = lib_divergence_matrix(ts, mode=mode)
            time_lib = time.perf_counter() - before
            before = time.perf_counter()
            D2 = ts.divergence_matrix(mode=mode)
            time_nb = time.perf_counter() - before
            # assert np.allclose(D1, D2)

            print(
                mode,
                n,
                ts.num_trees,
                ts.num_mutations,
                f"{time_lib:.2f}",
                f"{time_nb:.2f}",
                sep="\t",
            )

jeromekelleher · 2023-08-04T08:44:11Z

I tried running this on the unified genealogy trees for chr5 q (hgdp_tgp_sgdp_high_cov_ancients_chr5_q.dated.trees). Details:

╔═════════════════════════╗
║TreeSequence             ║
╠═══════════════╤═════════╣
║Trees          │   235014║
╟───────────────┼─────────╢
║Sequence Length│181538261║
╟───────────────┼─────────╢
║Time Units     │  unknown║
╟───────────────┼─────────╢
║Sample Nodes   │     7524║
╟───────────────┼─────────╢
║Total Size     │  1.1 GiB║
╚═══════════════╧═════════╝
╔═══════════╤════════╤═════════╤════════════╗
║Table      │Rows    │Size     │Has Metadata║
╠═══════════╪════════╪═════════╪════════════╣
║Edges      │ 9921165│302.8 MiB│          No║
╟───────────┼────────┼─────────┼────────────╢
║Individuals│    3762│  1.1 MiB│         Yes║
╟───────────┼────────┼─────────┼────────────╢
║Migrations │       0│  8 Bytes│          No║
╟───────────┼────────┼─────────┼────────────╢
║Mutations  │11000819│388.2 MiB│          No║
╟───────────┼────────┼─────────┼────────────╢
║Nodes      │ 1217256│ 62.2 MiB│         Yes║
╟───────────┼────────┼─────────┼────────────╢
║Populations│     215│ 11.9 KiB│         Yes║
╟───────────┼────────┼─────────┼────────────╢
║Provenances│      18│ 12.6 KiB│          No║
╟───────────┼────────┼─────────┼────────────╢
║Sites      │ 4325444│337.4 MiB│         Yes║
╚═══════════╧════════╧═════════╧════════════╝

For one thread it needs 2.7G of RAM. Running with 8 threads needed 6G of RAM (reasonable, since we're creating 8 n*n matrices). The old stats API based code quickly ran out of memory on my machine (32G RAM).

Finished up after about 3 hours:

47812.58user 208.43system 3:01:39elapsed

jeromekelleher · 2023-08-04T09:44:45Z

Ugggghhhh, looks like there's some memory access problem. Digging.

jeromekelleher · 2023-08-04T13:05:28Z

OK, fixed the memory error, ready for review!

jeromekelleher · 2023-08-05T10:11:52Z

And here's an experimental variant-based method. Has basically the same performance (some wrinkles to be worked out with multi-allelics)

jeromekelleher · 2023-08-07T11:56:00Z

I've updated this to include the correct algorithm for dealing with multi-allelic sites, to give the same answer as the stats API version. This will be easy to implement in C, if we decide to make the change.

What do you think @petrelharp? Unless we can see a clear way forward for making an efficient version counting mutations on the paths, I'm not sure there's much point in creating confusion here by using the number-of-mutations version for site-mode.

jeromekelleher · 2023-08-11T16:13:05Z

Just had a chat with @petrelharp and we decided to change the definition to exactly match the current stats API.

jeromekelleher · 2023-08-14T14:01:07Z

I think this is basically ready to go @petrelharp (although I need to rerun the benchmarks)

python/tests/test_divmat.py

petrelharp · 2023-08-14T17:45:35Z

LGTM! Things I have noticed are pending are

test when first window is not at 0 (at least codecov thinks this isn't)
fix up the docstring (see the old version)

Also, I think we should add the span_normalise argument (and probably have it default to True, as usual). (Here's how we do it elsewhere.) The reason is that, if I've got this right, currently the code is computing the non-span-normalized version (ie just counting mutations);l however, "divergence" usually means "number of mutations divided by sequence length", i.e., span normalized. But we'll have need for the "just count mutations" version, so rather than do error-prone post-hoc multiplying by window lengths to get it out, we should just add the argument.

jeromekelleher · 2023-08-15T07:55:08Z

Great, thanks @petrelharp! Good spots on all counts. I think we can merge this much for now, and follow up with more PRs.

Looks like this version is a small bit slower than the last version (~20%) but I don't think that's worth worrying about.

I logged issues to track the remaining stuff I think:

Closes tskit-dev#2779

jeromekelleher · 2023-08-15T08:45:58Z

Manually merging because we're missing a formal "approval"

jeromekelleher force-pushed the better-divmat-site-mode branch from 73dc0cf to bba57c1 Compare August 4, 2023 09:11

jeromekelleher marked this pull request as ready for review August 4, 2023 09:11

jeromekelleher requested a review from petrelharp August 4, 2023 09:11

jeromekelleher force-pushed the better-divmat-site-mode branch from ebb68bc to 4945dd8 Compare August 4, 2023 13:05

jeromekelleher force-pushed the better-divmat-site-mode branch from 4945dd8 to d9810bd Compare August 5, 2023 10:10

jeromekelleher marked this pull request as draft August 5, 2023 10:11

jeromekelleher force-pushed the better-divmat-site-mode branch 2 times, most recently from 8c2861e to ea6a53f Compare August 14, 2023 14:00

jeromekelleher marked this pull request as ready for review August 14, 2023 14:00

jeromekelleher force-pushed the better-divmat-site-mode branch from ea6a53f to 86e2960 Compare August 14, 2023 15:07

petrelharp reviewed Aug 14, 2023

View reviewed changes

python/tests/test_divmat.py Outdated Show resolved Hide resolved

jeromekelleher mentioned this pull request Aug 15, 2023

Add span_normalise to divergence_matrix #2818

Closed

Site divmat based on genotype decoding

efc3313

Closes tskit-dev#2779

jeromekelleher force-pushed the better-divmat-site-mode branch from 86e2960 to efc3313 Compare August 15, 2023 07:55

jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Aug 15, 2023

jeromekelleher merged commit 7190c2d into tskit-dev:main Aug 15, 2023
18 of 19 checks passed

jeromekelleher deleted the better-divmat-site-mode branch August 15, 2023 08:46

mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better divmat site mode #2813

Better divmat site mode #2813

jeromekelleher commented Aug 4, 2023 •

edited

Loading

jeromekelleher commented Aug 4, 2023 •

edited

Loading

jeromekelleher commented Aug 4, 2023

jeromekelleher commented Aug 4, 2023

jeromekelleher commented Aug 5, 2023

jeromekelleher commented Aug 7, 2023

jeromekelleher commented Aug 11, 2023

jeromekelleher commented Aug 14, 2023

petrelharp commented Aug 14, 2023

jeromekelleher commented Aug 15, 2023

jeromekelleher commented Aug 15, 2023

Better divmat site mode #2813

Better divmat site mode #2813

Conversation

jeromekelleher commented Aug 4, 2023 • edited Loading

jeromekelleher commented Aug 4, 2023 • edited Loading

jeromekelleher commented Aug 4, 2023

jeromekelleher commented Aug 4, 2023

jeromekelleher commented Aug 5, 2023

jeromekelleher commented Aug 7, 2023

jeromekelleher commented Aug 11, 2023

jeromekelleher commented Aug 14, 2023

petrelharp commented Aug 14, 2023

jeromekelleher commented Aug 15, 2023

jeromekelleher commented Aug 15, 2023

jeromekelleher commented Aug 4, 2023 •

edited

Loading

jeromekelleher commented Aug 4, 2023 •

edited

Loading