-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unusually High h2est Values: Exceeding 1 and Reaching up to 2 or 3 #420
Comments
Can you tell me more? |
There is something going on with this QC plot; you should not filter all these variants. Were the sumstats derived from a linear mixed model like BOLT-LMM? Could you color this plot by chromosome, and then by INFO score (if you have this info). |
I used deCODE summary statistics from icelandic population (linear mixed model implemented in BOLT-LMM). I don't have INFO. I would like to share with you my code related to LDpred2-auto. |
Could you provide the link where I can download these sumstats? |
Sure! Here is the link to the summary statistics file: |
Dear Privefl, Have you tried with the summary statistics? Please let me know if you found something, or if you need any further information. |
I've tried on the first one, and have the same problem. Given that these are mixed-model sumstats (which breaks some assumptions of polygenic score models), and that some effects are so large, I think it is quite difficult to estimate the heritability here. If you can get normal linear regression summary statistics, I think it would help. |
Thank you very much for your help! If I were to continue using the mixed-model summary statistics and do not have a validation set available, do you have any recommendations or suggestions on how to proceed? |
Depends on what you want to do:
|
Getting a good estimate of h2 is not a guarantee for getting a good PGS. Do you have links for the FENLAND / AGES sumstats? |
a_trait.zip https://drive.google.com/file/d/1lmnpNDpPiwliMvd2Bb33hPpzzVJkLFp9/view?usp=share_link |
What is the sample size for the second one? |
5368 |
LD score reg results:
Large effects and small N -> complicated |
Indeed, all If you look at the estimated r2 for each variant separately I guess LDpred2 cannot cope with such large effects, especially with such a small GWAS sample size. |
Running LDpred2 on COJO residualized summary stats is maybe not a very common idea, but a good idea in this situation. Using COJO you might have to account for a couple of variants. Also, when residualizing the GWAS sum stats it's a good idea to update even the distant (not in LD) effect sizes, as their relative effect sizes will increase when a large part of the phenotypic variance is explained. See e.g. https://www.medrxiv.org/content/10.1101/2022.11.09.22281216v1 as an example. |
I don't think effect sizes would change, just their SE would be smaller. |
Yes, true. It becomes more significant, and therefore the z-score increases. |
I am trying to reproduce this behavior in simulations, but this is a bit challenging. |
|
Hello. Yes it is protein data. I forget which one it is but most proteomics data get abnormal h2. |
We used the same data as you used for Iceland, and h2 had the same anomaly and periodic failure to converge. I did not find this AGES data you used, can you provide a link to the article. |
An update on the simulations. I can definitely reproduce the very large LDSc h2 estimates. |
I think training LDpred2 on the Ferkingstad et al. (Nat Genet 2021) data (currently) has some challenges. First, these are BOLT-LMM summary statistics, which are different from standard least square or logistic regression marginal effect sizes. The BOLT-LMM sum stats might also be slightly inflated due to family structure (see e.g. Jiang et al., Nat Genet 2019). Second, the sample has rather strong sample ascertainment, which has been known to bias heritability estimates. According to the paper, the participants are heavily enriched to be cancer patients and likely with for other diagnoses as well. Third, assuming a point-Normal prior for some protein levels might simply be suboptimal? |
Dear Author,
I have been using LDpred2-auto for my analyses, and I encountered an issue where many of the h2est values I obtained exceeded 1, with some converging around 2 or even 3. I have checked the chains and confirmed that they have indeed converged. Out of the 42 outcomes I analyzed, only one provided a reasonable h2 estimate (h2=0.67 and in my test set R2=0.75, this is a good result). For my analysis, I used the LD matrix provided by you and applied it to GWAS data from the Icelandic population. The selected outcomes should all have high heritability (R2 > 0.7). I used the QC process according to your guidance: "sd_ldref <- with(gwas_match, sqrt(2 * ImpMAF * (1 - ImpMAF)))
sd_ss <- with(gwas_match, 1 / sqrt(N * SE^2+beta^2))
fail_qc <-
sd_ss < (0.7 * sd_ldref) | sd_ss > (sd_ldref + 0.1) | sd_ss < 0.1 | sd_ldref < 0.05
"
I am wondering if you could provide any guidance or suggestions on how to address this issue, or if there is anything I might have overlooked in the process.
Thank you for your assistance.
The text was updated successfully, but these errors were encountered: