Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unusually High h2est Values: Exceeding 1 and Reaching up to 2 or 3 #420

Open
PEWilliamZhou opened this issue Apr 12, 2023 · 27 comments
Open

Comments

@PEWilliamZhou
Copy link

ldpred2_auto_chain_convergence
Dear Author,

I have been using LDpred2-auto for my analyses, and I encountered an issue where many of the h2est values I obtained exceeded 1, with some converging around 2 or even 3. I have checked the chains and confirmed that they have indeed converged. Out of the 42 outcomes I analyzed, only one provided a reasonable h2 estimate (h2=0.67 and in my test set R2=0.75, this is a good result). For my analysis, I used the LD matrix provided by you and applied it to GWAS data from the Icelandic population. The selected outcomes should all have high heritability (R2 > 0.7). I used the QC process according to your guidance: "sd_ldref <- with(gwas_match, sqrt(2 * ImpMAF * (1 - ImpMAF)))
sd_ss <- with(gwas_match, 1 / sqrt(N * SE^2+beta^2))
fail_qc <-
sd_ss < (0.7 * sd_ldref) | sd_ss > (sd_ldref + 0.1) | sd_ss < 0.1 | sd_ldref < 0.05
"

I am wondering if you could provide any guidance or suggestions on how to address this issue, or if there is anything I might have overlooked in the process.

Thank you for your assistance.

@privefl
Copy link
Owner

privefl commented Apr 13, 2023

Can you tell me more?
Are these continuous traits? Binary traits?
What is the sample size?
Can you show one of the QC plots comparing SDs?
What is the estimate you get from LD Score reg?

@PEWilliamZhou
Copy link
Author

PEWilliamZhou commented Apr 13, 2023

Thank you for your quick response!
These are all continuous traits, specifically proteomics traits. For instance, take the trait mentioned above (5810_25_TDGF1_Cripto), which has effective sample sizes (n_eff) of 34,527~35,449. The number of SNPs that passed QC is 636,782, while 750,760 did not pass (for most traits, more than 1,380,000 matched and over 1,000,000 passed QC, but they still have h2 >1). The LDSC results are as follows:
int 0.940522128
int_se 0.010463805
h2 1.620702647
h2_se 0.943200144
Here's the QC figure:
ldpred2_gwasqc_5810_25_TDGF1_Cripto

@privefl
Copy link
Owner

privefl commented Apr 14, 2023

There is something going on with this QC plot; you should not filter all these variants.

Were the sumstats derived from a linear mixed model like BOLT-LMM?
Do you have a mix of both simple linear reg and mixed model maybe?

Could you color this plot by chromosome, and then by INFO score (if you have this info).

@PEWilliamZhou
Copy link
Author

I used deCODE summary statistics from icelandic population (linear mixed model implemented in BOLT-LMM). I don't have INFO.
1681474925007
1681475321017
I also want to let you know that I tried using both ld <- c(ld, Matrix::colSums(corr_chr^2)) (calculated by selected SNP matrix) and the ld in your hm3+, and tried using your af_UKBB in map file instead of af from deCODE, and they all showed bad results with h2>1.
You mentioned that I should not filter all these variants. Can you tell me the correct way to perform QC?

I would like to share with you my code related to LDpred2-auto.
part of my code.txt

@privefl
Copy link
Owner

privefl commented Apr 14, 2023

Could you provide the link where I can download these sumstats?

@PEWilliamZhou
Copy link
Author

Sure! Here is the link to the summary statistics file:
list43.txt
I recommend starting with the first 6 lines. I could only obtain reasonable results for SIGLEC9_Siglec_9, while the other traits provided incorrect h2 estimates.
This is deCODE readme file.
proteomics_readme.txt

@PEWilliamZhou
Copy link
Author

Could you provide the link where I can download these sumstats?

Dear Privefl,

Have you tried with the summary statistics? Please let me know if you found something, or if you need any further information.

@privefl
Copy link
Owner

privefl commented Apr 19, 2023

I've tried on the first one, and have the same problem.

Given that these are mixed-model sumstats (which breaks some assumptions of polygenic score models), and that some effects are so large, I think it is quite difficult to estimate the heritability here.

If you can get normal linear regression summary statistics, I think it would help.

@PEWilliamZhou
Copy link
Author

Thank you very much for your help!

If I were to continue using the mixed-model summary statistics and do not have a validation set available, do you have any recommendations or suggestions on how to proceed?

@privefl
Copy link
Owner

privefl commented Apr 24, 2023

Depends on what you want to do:

  • if only compute PGS, maybe keep all variants (there is no clear outliers)
  • I'm not sure you can reliably estimate h2 from this (with any sumstats method whatsoever)

@PEWilliamZhou
Copy link
Author

Dear author,
Thank you for your help.

I tried LDpred2-auto and your HM3+ on two other sumstats with linear model: FENLAND study and AGES study (selecting several proteomics traits that have h2 ~0.7). I still got h2 >1.

I think my code is correct, since I can get normal result with your "public-data3-sumstats.txt".

Could this be because the external LD reference is being used instead of the LD from the summary statistics samples?

My aim is to compute those PGS without validation set. If I cannot estimate correct h2, is there a way to get a reliable PGS?
1682631827833

@privefl
Copy link
Owner

privefl commented Apr 28, 2023

Getting a good estimate of h2 is not a guarantee for getting a good PGS.
Similarly, a bad estimate does not necessarily mean that you get a bad PGS.
Prediction and inference are kind of two different things.
What are the range that you get?

Do you have links for the FENLAND / AGES sumstats?

@PEWilliamZhou
Copy link
Author

a_trait.zip
For this traits, range is around 0.97.

https://drive.google.com/file/d/1lmnpNDpPiwliMvd2Bb33hPpzzVJkLFp9/view?usp=share_link
Its ranges are around 1.6 or around 0.001

@privefl
Copy link
Owner

privefl commented Apr 28, 2023

What is the sample size for the second one?

@PEWilliamZhou
Copy link
Author

What is the sample size for the second one?

5368

@privefl
Copy link
Owner

privefl commented Apr 28, 2023

Using sd(y)=1, the QC plot is very good.
qcplot

I'll run LDpred2-auto now.

@privefl
Copy link
Owner

privefl commented Apr 28, 2023

LD score reg results:

       int     int_se         h2      h2_se 
1.17693343 0.01450483 2.83327279 2.82301985

Large effects and small N -> complicated

@privefl
Copy link
Owner

privefl commented Apr 28, 2023

Indeed, all range are 1.60-1.61 and the heritability estimate is 3.1 [2.6-3.6].

If you look at the estimated r2 for each variant separately
r2 <- with(sumstats, beta^2 / (beta^2 + n_eff * beta_se^2))
there is one variant that explains 82.5% of the phenotypic variance.

I guess LDpred2 cannot cope with such large effects, especially with such a small GWAS sample size.
A common strategy in this case is to use this effect as fixed with its GWAS sample size (or use a few variants with COJO), and make a predictor with LDpred2 based on the residualized sumstats (you can get this from COJO maybe, or just consider all other variants that are not correlated with the top one, e.g. from different LD blocks).

@bvilhjal
Copy link

Running LDpred2 on COJO residualized summary stats is maybe not a very common idea, but a good idea in this situation. Using COJO you might have to account for a couple of variants. Also, when residualizing the GWAS sum stats it's a good idea to update even the distant (not in LD) effect sizes, as their relative effect sizes will increase when a large part of the phenotypic variance is explained. See e.g. https://www.medrxiv.org/content/10.1101/2022.11.09.22281216v1 as an example.

@privefl
Copy link
Owner

privefl commented Apr 28, 2023

I don't think effect sizes would change, just their SE would be smaller.

@bvilhjal
Copy link

Yes, true. It becomes more significant, and therefore the z-score increases.

@privefl
Copy link
Owner

privefl commented Jun 30, 2023

I am trying to reproduce this behavior in simulations, but this is a bit challenging.

@xuxwen
Copy link

xuxwen commented Jun 30, 2023

a_trait.zip For this traits, range is around 0.97.

https://drive.google.com/file/d/1lmnpNDpPiwliMvd2Bb33hPpzzVJkLFp9/view?usp=share_link Its ranges are around 1.6 or around 0.001
Hello, I would like to know if this AGES data you provided is the QTL data of the protein, is this a protein data, I downloaded and looked at it and I don't know if it is a protein data, there are about 7.5 million SNP numbers in total.

@PEWilliamZhou
Copy link
Author

Hello. Yes it is protein data. I forget which one it is but most proteomics data get abnormal h2.

@xuxwen
Copy link

xuxwen commented Jun 30, 2023

Hello. Yes it is protein data. I forget which one it is but most proteomics data get abnormal h2.

We used the same data as you used for Iceland, and h2 had the same anomaly and periodic failure to converge. I did not find this AGES data you used, can you provide a link to the article.

@privefl
Copy link
Owner

privefl commented Jun 30, 2023

An update on the simulations. I can definitely reproduce the very large LDSc h2 estimates.
But then I get normal h2 with LDpred2-auto in simulations.
I also get normal LDSc intercept.
Anything going on with the 1.177 estimate we get for the intercept? Pop structure or relatedness maybe?

@bvilhjal
Copy link

bvilhjal commented Jun 30, 2023

I think training LDpred2 on the Ferkingstad et al. (Nat Genet 2021) data (currently) has some challenges. First, these are BOLT-LMM summary statistics, which are different from standard least square or logistic regression marginal effect sizes. The BOLT-LMM sum stats might also be slightly inflated due to family structure (see e.g. Jiang et al., Nat Genet 2019). Second, the sample has rather strong sample ascertainment, which has been known to bias heritability estimates. According to the paper, the participants are heavily enriched to be cancer patients and likely with for other diagnoses as well. Third, assuming a point-Normal prior for some protein levels might simply be suboptimal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants