Skip to content

Quality assessment of imputation : r2, infoscore, K #111

@EliseGAY

Description

@EliseGAY

Hi,
I recently wrote this question in an old issue but it may be too old to reopen, so I'm giving it a chance in a new issue (apologize for the duplication).

I would like to share our results so that perhaps you could help us understanding the patterns of our imputations.
A bit of context, we have 96 individuals of vivparous lizard (wild population) sequenced with the haplotagging technique which generate linked reads but with a sequencing depth from 0.5 to 1X. We tried to assess the quality of imputation by testing combinations of parameters. We used different gridWindowsSize values (to speed up if possible) and different K. Knowing that the ngen doesn't have much impact I won't present it. We then checked out the info_score, haplotype proportion pattern and the R2.

  1. First, we used the grid size to speed up our imputation, but we are not sure at which point it impacts the imputation. We tested various gridsize with a fixed K = 10. The infoscores are much better when the size increases (top panel) but it has a huge impact on the haplotype frequency patterns (middle panel). Also the r2 decreases as the gridsize increases (bottom panel) :

Image

Image

Image

We are very curious about the impact of the gridsize on the imputation, even if we probably won't use any size at the end.

  1. We chose a gridsize = 1Kb in order to test other paramters. We tested several K (10, 20, 30, 80) and assessed the Info_score and the R2. In contrast to Nate's case (issue K optimization: r2 vs score #59), the R2 (always with a very huge dispersion) and the info_score increase continuously :

Image

Image

Based on that results, if we chose the pragmatic way we would say that the higher R2 and Infoscores, the better is the imputation. But this would correspond to K = 80. In another hand, we have also been told that the closer our K is to the number of individuals the more the imputation will overfit the data and generate incertainties.
A likely hypothesis is that, with our dataset we are not able to properly assess the quality of imputation, due to some kind of biaises ? Or maybe you have some other way of thinking to share with us ?

Thanks a lot !
Have a good day

Elise

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions