Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for genotype imputation #2815

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

szhan
Copy link
Member

@szhan szhan commented Aug 7, 2023

Description

Add toy examples for testing imputation. Results obtained by running BEAGLE 4.1 are stored for comparison.

Fixes #2802

PR Checklist:

  • Implement BEAGLE's interpolation-style imputation algorithm in Python.
  • Implement a simpler version of the BEAGLE algorithm using functionalities in _tskit (tsimpute).
  • Develop toy datasets for comparing BEAGLE and tskit-based imputation.
  • Record BEAGLE results (including parameters; forward, backward, and state prob. matrices; allele probabilities).
  • Implement functions allowing for convenient access to the toy datasets and BEAGLE results.
  • Do test runs to profile performance.

@codecov
Copy link

codecov bot commented Aug 7, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.97%. Comparing base (8b1be4f) to head (24d7a49).

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #2815       +/-   ##
===========================================
- Coverage   89.79%   76.97%   -12.82%     
===========================================
  Files          30       30               
  Lines       30399    30399               
  Branches     5909     5643      -266     
===========================================
- Hits        27296    23400     -3896     
- Misses       1778     5614     +3836     
- Partials     1325     1385       +60     
Flag Coverage Δ
c-tests 86.21% <ø> (ø)
lwt-tests 80.78% <ø> (ø)
python-c-tests 67.71% <ø> (ø)
python-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 13 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8b1be4f...24d7a49. Read the comment docs.

@szhan szhan force-pushed the test_imputation branch from 8a7bb97 to dfb4c54 Compare August 8, 2023 07:13
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!

And the $10M question - do we compute the same thing with tskit???

python/tests/test_imputation.py Outdated Show resolved Hide resolved
@szhan
Copy link
Member Author

szhan commented Aug 8, 2023

I think the tricky is getting the normalization factors right.

@jeromekelleher
Copy link
Member

So we might not get exactly the same values, but I guess would expect the output matrices to be proportional (since normalisation factors can be arbitrary)

@szhan szhan force-pushed the test_imputation branch 2 times, most recently from 200b6ab to b47db86 Compare August 8, 2023 13:52
@jeromekelleher
Copy link
Member

I've updated this to get the tskit code to run @szhan, and to remove the dependence on pandas for CSV parsing (which would have been a pain to update all the requirements.txt files). We can do the same thing (more or less) with numpy.

@szhan szhan force-pushed the test_imputation branch 4 times, most recently from dc0d3e9 to b4628f8 Compare August 9, 2023 12:24
@szhan
Copy link
Member Author

szhan commented Aug 9, 2023

Here is the first stab at the comparison.

Rows = sites
Columns = ref. haplotypes

Forward probability matrix
tskit

[[0.23333333 0.03333333 0.23333333 0.23333333 0.03333333 0.23333333]
 [0.24637681 0.00724638 0.24637681 0.24637681 0.00724638 0.24637681]
 [0.2384058  0.02318841 0.2384058  0.2384058  0.02318841 0.2384058 ]
 [0.15942246 0.18115508 0.15942246 0.15942246 0.18115508 0.15942246]
 [0.05073599 0.39852802 0.05073599 0.05073599 0.39852802 0.05073599]]

beagle

[[9.9990e-01 1.0000e-04 9.9990e-01 9.9990e-01 1.0000e-04 9.9990e-01]
 [1.6665e-01 1.7000e-05 1.6665e-01 1.6665e-01 1.7000e-05 1.6665e-01]
 [1.7000e-05 1.6665e-01 1.7000e-05 1.7000e-05 1.6665e-01 1.7000e-05]
 [1.7000e-05 1.6665e-01 1.7000e-05 1.7000e-05 1.6665e-01 1.7000e-05]]

Backward probability matrix
tskit

[[0.97527473 1.34615385 0.97527473 0.97527473 1.34615385 0.97527473]
 [0.88461538 8.84615385 0.88461538 0.88461538 8.84615385 0.88461538]
 [0.58974359 9.43589744 0.58974359 0.58974359 9.43589744 0.58974359]
 [0.38017094 2.09094017 0.38017094 0.38017094 2.09094017 0.38017094]
 [1.         1.         1.         1.         1.         1.        ]]

beagle

[[0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]
 [0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]
 [0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]
 [0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]]

The forward values look pretty proportional to me. But I don't get why the values in the BEAGLE backward matrix is so uniform.

@szhan
Copy link
Member Author

szhan commented Aug 9, 2023

I suspect it has to do with my compiled copy of BEAGLE 4.1. BEAGLE's forward-backward algorithm only keeps track of the values for one site/marker in order to reduce memory requirements. For some reasons, I'm not seeing it being updated across the iterations.

@szhan
Copy link
Member Author

szhan commented Aug 9, 2023

Also, I'm seeing that BEAGLE is initialising the backward probability matrix to 1/n, where n = number of haplotypes in the reference panel.

See in LSHapBaum.java.

public HapAlleleProbs randomHapSample(int hap) {
        Arrays.fill(alleleProbs, 0f);
        int nMarkers = impData.nClusters();
        windowIndex = 0;
        arrayIndex = -1;
        setForwardValues(0, nMarkers, hap);
        Arrays.fill(bwdVal, 1.0f/n);
        setStateProbs(nMarkers-1, currentIndex());
        for (int m=nMarkers-2; m>=0; --m) {
            setBwdValue(m, hap);
            setStateProbs(m, previousIndex(hap));
        }
        setAlleleProbs(alleleProbs);
        return new LowMemHapAlleleProbs(refMarkers, impData.targetSamples(),
                hap, alleleProbs);
    }

See this line Arrays.fill(bwdVal, 1.0f/n);.

@jeromekelleher
Copy link
Member

OK, I guess we'll need to figure this out. But, we should be able to check that the final imputed value (info scores?) are equal, easily enough?

@szhan
Copy link
Member Author

szhan commented Aug 9, 2023

There's also the matrix that combines the forward and backward values. I think BEAGLE gets the allele probabilities out of that.

@szhan
Copy link
Member Author

szhan commented Aug 9, 2023

I just discussed with Duncan. He spotted that the uniform backward values has to do with parametrization.

@szhan szhan force-pushed the test_imputation branch 6 times, most recently from 2dd1007 to 0395bf1 Compare August 27, 2023 10:22
Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really excellent @szhan! I'm not sure whether the Beagle code belongs here or in the tsimpute repo, but it's really good progress.

I guess ultimately what we want in tskit is some intermediate results that we compute using the Beagle algorithms that we can compare with what we compute in tskit. The actual job of imputation has a lot of moving parts, which we probably don't want to put together directly in tskit.

python/tests/test_beagle.py Outdated Show resolved Hide resolved
python/tests/test_beagle.py Outdated Show resolved Hide resolved
@szhan
Copy link
Member Author

szhan commented Aug 31, 2023

I think we should just focus on the interpolated allele probabilities at all the imputed markers on query haplotypes. We can compare these allele probabilities from BEAGLE and our implementation here.

@szhan
Copy link
Member Author

szhan commented Aug 31, 2023

Just to summarise the outputs from BEAGLE we can probably use for comparison here:

  • Forward probability matrix at genotyped markers.
  • Backward probability matrix at genotyped markers.
  • HMM state probability matrix at genotyped markers.
  • Interpolated allele probabilities at imputed markers.

It is hard to replicate exactly what BEAGLE is doing, so at best we will be able to look at correlations between the values (as discussed above).

@szhan szhan force-pushed the test_imputation branch 17 times, most recently from 9119d88 to fe70e44 Compare March 4, 2024 10:20
@szhan szhan force-pushed the test_imputation branch 8 times, most recently from cb323a0 to 9895525 Compare March 11, 2024 08:48
@szhan szhan force-pushed the test_imputation branch from 1d4f241 to 24d7a49 Compare March 11, 2024 14:52
@benjeffery
Copy link
Member

Hi @szhan - what's the plan with this work?

@szhan
Copy link
Member Author

szhan commented Sep 23, 2024

Will circle back to this. Please keep.

@benjeffery benjeffery added this to the Python 0.5.10 milestone Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add test_imputation.py file
3 participants