Add tests for genotype imputation #2815

szhan · 2023-08-07T17:24:36Z

Description

Add toy examples for testing imputation. Results obtained by running BEAGLE 4.1 are stored for comparison.

Fixes #2802

PR Checklist:

Implement BEAGLE's interpolation-style imputation algorithm in Python.
Implement a simpler version of the BEAGLE algorithm using functionalities in _tskit (tsimpute).
Develop toy datasets for comparing BEAGLE and tskit-based imputation.
Record BEAGLE results (including parameters; forward, backward, and state prob. matrices; allele probabilities).
Implement functions allowing for convenient access to the toy datasets and BEAGLE results.
Do test runs to profile performance.

codecov · 2023-08-07T17:55:20Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.97%. Comparing base (8b1be4f) to head (24d7a49).

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2815       +/-   ##
===========================================
- Coverage   89.79%   76.97%   -12.82%     
===========================================
  Files          30       30               
  Lines       30399    30399               
  Branches     5909     5643      -266     
===========================================
- Hits        27296    23400     -3896     
- Misses       1778     5614     +3836     
- Partials     1325     1385       +60

Flag	Coverage Δ
c-tests	`86.21% <ø> (ø)`
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`67.71% <ø> (ø)`
python-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

see 13 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8b1be4f...24d7a49. Read the comment docs.

jeromekelleher

Excellent!

And the $10M question - do we compute the same thing with tskit???

python/tests/test_imputation.py

szhan · 2023-08-08T08:54:57Z

I think the tricky is getting the normalization factors right.

jeromekelleher · 2023-08-08T11:51:42Z

So we might not get exactly the same values, but I guess would expect the output matrices to be proportional (since normalisation factors can be arbitrary)

jeromekelleher · 2023-08-09T09:21:48Z

I've updated this to get the tskit code to run @szhan, and to remove the dependence on pandas for CSV parsing (which would have been a pain to update all the requirements.txt files). We can do the same thing (more or less) with numpy.

szhan · 2023-08-09T12:27:01Z

Here is the first stab at the comparison.

Rows = sites
Columns = ref. haplotypes

Forward probability matrix
tskit

[[0.23333333 0.03333333 0.23333333 0.23333333 0.03333333 0.23333333]
 [0.24637681 0.00724638 0.24637681 0.24637681 0.00724638 0.24637681]
 [0.2384058  0.02318841 0.2384058  0.2384058  0.02318841 0.2384058 ]
 [0.15942246 0.18115508 0.15942246 0.15942246 0.18115508 0.15942246]
 [0.05073599 0.39852802 0.05073599 0.05073599 0.39852802 0.05073599]]

beagle

[[9.9990e-01 1.0000e-04 9.9990e-01 9.9990e-01 1.0000e-04 9.9990e-01]
 [1.6665e-01 1.7000e-05 1.6665e-01 1.6665e-01 1.7000e-05 1.6665e-01]
 [1.7000e-05 1.6665e-01 1.7000e-05 1.7000e-05 1.6665e-01 1.7000e-05]
 [1.7000e-05 1.6665e-01 1.7000e-05 1.7000e-05 1.6665e-01 1.7000e-05]]

Backward probability matrix
tskit

[[0.97527473 1.34615385 0.97527473 0.97527473 1.34615385 0.97527473]
 [0.88461538 8.84615385 0.88461538 0.88461538 8.84615385 0.88461538]
 [0.58974359 9.43589744 0.58974359 0.58974359 9.43589744 0.58974359]
 [0.38017094 2.09094017 0.38017094 0.38017094 2.09094017 0.38017094]
 [1.         1.         1.         1.         1.         1.        ]]

beagle

[[0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]
 [0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]
 [0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]
 [0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]]

The forward values look pretty proportional to me. But I don't get why the values in the BEAGLE backward matrix is so uniform.

szhan · 2023-08-09T14:21:05Z

I suspect it has to do with my compiled copy of BEAGLE 4.1. BEAGLE's forward-backward algorithm only keeps track of the values for one site/marker in order to reduce memory requirements. For some reasons, I'm not seeing it being updated across the iterations.

szhan · 2023-08-09T14:23:53Z

Also, I'm seeing that BEAGLE is initialising the backward probability matrix to 1/n, where n = number of haplotypes in the reference panel.

See in LSHapBaum.java.

public HapAlleleProbs randomHapSample(int hap) {
        Arrays.fill(alleleProbs, 0f);
        int nMarkers = impData.nClusters();
        windowIndex = 0;
        arrayIndex = -1;
        setForwardValues(0, nMarkers, hap);
        Arrays.fill(bwdVal, 1.0f/n);
        setStateProbs(nMarkers-1, currentIndex());
        for (int m=nMarkers-2; m>=0; --m) {
            setBwdValue(m, hap);
            setStateProbs(m, previousIndex(hap));
        }
        setAlleleProbs(alleleProbs);
        return new LowMemHapAlleleProbs(refMarkers, impData.targetSamples(),
                hap, alleleProbs);
    }

See this line Arrays.fill(bwdVal, 1.0f/n);.

jeromekelleher · 2023-08-09T15:08:26Z

OK, I guess we'll need to figure this out. But, we should be able to check that the final imputed value (info scores?) are equal, easily enough?

szhan · 2023-08-09T15:29:23Z

There's also the matrix that combines the forward and backward values. I think BEAGLE gets the allele probabilities out of that.

szhan · 2023-08-09T20:30:14Z

I just discussed with Duncan. He spotted that the uniform backward values has to do with parametrization.

jeromekelleher

This looks really excellent @szhan! I'm not sure whether the Beagle code belongs here or in the tsimpute repo, but it's really good progress.

I guess ultimately what we want in tskit is some intermediate results that we compute using the Beagle algorithms that we can compare with what we compute in tskit. The actual job of imputation has a lot of moving parts, which we probably don't want to put together directly in tskit.

python/tests/test_beagle.py

szhan · 2023-08-31T09:21:20Z

I think we should just focus on the interpolated allele probabilities at all the imputed markers on query haplotypes. We can compare these allele probabilities from BEAGLE and our implementation here.

szhan · 2023-08-31T09:55:39Z

Just to summarise the outputs from BEAGLE we can probably use for comparison here:

Forward probability matrix at genotyped markers.
Backward probability matrix at genotyped markers.
HMM state probability matrix at genotyped markers.
Interpolated allele probabilities at imputed markers.

It is hard to replicate exactly what BEAGLE is doing, so at best we will be able to look at correlations between the values (as discussed above).

benjeffery · 2024-09-23T10:59:20Z

Hi @szhan - what's the plan with this work?

szhan · 2024-09-23T12:16:57Z

Will circle back to this. Please keep.

szhan force-pushed the test_imputation branch from 8a7bb97 to dfb4c54 Compare August 8, 2023 07:13

jeromekelleher reviewed Aug 8, 2023

View reviewed changes

python/tests/test_imputation.py Outdated Show resolved Hide resolved

szhan force-pushed the test_imputation branch 2 times, most recently from 200b6ab to b47db86 Compare August 8, 2023 13:52

szhan force-pushed the test_imputation branch 4 times, most recently from dc0d3e9 to b4628f8 Compare August 9, 2023 12:24

szhan force-pushed the test_imputation branch 6 times, most recently from 2dd1007 to 0395bf1 Compare August 27, 2023 10:22

szhan force-pushed the test_imputation branch from 6ca8371 to 5a5d711 Compare August 30, 2023 12:56

jeromekelleher reviewed Aug 31, 2023

View reviewed changes

python/tests/test_beagle.py Outdated Show resolved Hide resolved

python/tests/test_beagle.py Outdated Show resolved Hide resolved

szhan force-pushed the test_imputation branch from 62e8ed8 to ea1a574 Compare August 31, 2023 09:10

szhan force-pushed the test_imputation branch from 2511167 to edcb26e Compare August 31, 2023 09:53

szhan force-pushed the test_imputation branch 17 times, most recently from 9119d88 to fe70e44 Compare March 4, 2024 10:20

szhan force-pushed the test_imputation branch 8 times, most recently from cb323a0 to 9895525 Compare March 11, 2024 08:48

Add tests to check LS HMM of _tskit.lshmm compared to BEAGLE

24d7a49

szhan force-pushed the test_imputation branch from 1d4f241 to 24d7a49 Compare March 11, 2024 14:52

benjeffery added this to the Python 0.5.10 milestone Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tests for genotype imputation #2815

Add tests for genotype imputation #2815

szhan commented Aug 7, 2023 •

edited

Loading

codecov bot commented Aug 7, 2023 •

edited

Loading

jeromekelleher left a comment

szhan commented Aug 8, 2023

jeromekelleher commented Aug 8, 2023

jeromekelleher commented Aug 9, 2023

szhan commented Aug 9, 2023 •

edited

Loading

szhan commented Aug 9, 2023

szhan commented Aug 9, 2023

jeromekelleher commented Aug 9, 2023

szhan commented Aug 9, 2023

szhan commented Aug 9, 2023

jeromekelleher left a comment

szhan commented Aug 31, 2023

szhan commented Aug 31, 2023 •

edited

Loading

benjeffery commented Sep 23, 2024

szhan commented Sep 23, 2024

Add tests for genotype imputation #2815

Are you sure you want to change the base?

Add tests for genotype imputation #2815

Conversation

szhan commented Aug 7, 2023 • edited Loading

Description

PR Checklist:

codecov bot commented Aug 7, 2023 • edited Loading

Codecov Report

jeromekelleher left a comment

Choose a reason for hiding this comment

szhan commented Aug 8, 2023

jeromekelleher commented Aug 8, 2023

jeromekelleher commented Aug 9, 2023

szhan commented Aug 9, 2023 • edited Loading

szhan commented Aug 9, 2023

szhan commented Aug 9, 2023

jeromekelleher commented Aug 9, 2023

szhan commented Aug 9, 2023

szhan commented Aug 9, 2023

jeromekelleher left a comment

Choose a reason for hiding this comment

szhan commented Aug 31, 2023

szhan commented Aug 31, 2023 • edited Loading

benjeffery commented Sep 23, 2024

szhan commented Sep 23, 2024

szhan commented Aug 7, 2023 •

edited

Loading

codecov bot commented Aug 7, 2023 •

edited

Loading

szhan commented Aug 9, 2023 •

edited

Loading

szhan commented Aug 31, 2023 •

edited

Loading