Quality assessment of imputation : r2, infoscore, K

Hi,
I recently wrote this question in an old issue but it may be too old to reopen, so I'm giving it a chance in a new issue (apologize for the duplication).

I would like to share our results so that perhaps you could help us understanding the patterns of our imputations. 
A bit of context, we have 96 individuals of vivparous lizard (wild population) sequenced with the haplotagging technique which generate linked reads but with a sequencing depth from 0.5 to 1X. We tried to assess the quality of imputation by testing combinations of parameters. We used different gridWindowsSize values (to speed up if possible) and different K. Knowing that the ngen doesn't have much impact I won't present it. We then checked out the info_score, haplotype proportion pattern and the R2.

1. First, we used the grid size to speed up our imputation, but we are not sure at which point it impacts the imputation. We tested various gridsize with a fixed K = 10. The infoscores are much better when the size increases (top panel) but it has a huge impact on the haplotype frequency patterns (middle panel). Also the r2 decreases as the gridsize increases (bottom panel) :

![Image](https://github.com/user-attachments/assets/3184e9cb-22d4-4627-9d87-19bf0ed0d80f)

![Image](https://github.com/user-attachments/assets/e76315aa-d68d-492b-8364-1de083d2f435)

![Image](https://github.com/user-attachments/assets/55cc72c6-7e53-42e0-a921-a56e855d26e5)

We are very curious about the impact of the gridsize on the imputation, even if we probably won't use any size at the end.

2. We chose a gridsize = 1Kb in order to test other paramters. We tested several K (10, 20, 30, 80) and assessed the Info_score and the R2. In contrast to Nate's case (issue #59), the R2 (always with a very huge dispersion) and the info_score increase continuously :

![Image](https://github.com/user-attachments/assets/f9ff31b4-e770-45df-87d7-59a7198e05f8)

![Image](https://github.com/user-attachments/assets/13b9c52b-b26f-4397-81b1-cf746adcc5d1)


Based on that results, if we chose the pragmatic way we would say that the higher R2 and Infoscores, the better is the imputation. But this would correspond to K = 80. In another hand, we have also been told that the closer our K is to the number of individuals the more the imputation will overfit the data and generate incertainties.
A likely hypothesis is that, with our dataset we are not able to properly assess the quality of imputation, due to some kind of biaises ? Or maybe you have some other way of thinking to share with us ?

Thanks a lot !
Have a good day

Elise

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quality assessment of imputation : r2, infoscore, K #111

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Quality assessment of imputation : r2, infoscore, K #111

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions