Unusually High h2est Values: Exceeding 1 and Reaching up to 2 or 3 #420

PEWilliamZhou · 2023-04-12T20:01:52Z

Dear Author,

I have been using LDpred2-auto for my analyses, and I encountered an issue where many of the h2est values I obtained exceeded 1, with some converging around 2 or even 3. I have checked the chains and confirmed that they have indeed converged. Out of the 42 outcomes I analyzed, only one provided a reasonable h2 estimate (h2=0.67 and in my test set R2=0.75, this is a good result). For my analysis, I used the LD matrix provided by you and applied it to GWAS data from the Icelandic population. The selected outcomes should all have high heritability (R2 > 0.7). I used the QC process according to your guidance: "sd_ldref <- with(gwas_match, sqrt(2 * ImpMAF * (1 - ImpMAF)))
sd_ss <- with(gwas_match, 1 / sqrt(N * SE^2+beta^2))
fail_qc <-
sd_ss < (0.7 * sd_ldref) | sd_ss > (sd_ldref + 0.1) | sd_ss < 0.1 | sd_ldref < 0.05
"

I am wondering if you could provide any guidance or suggestions on how to address this issue, or if there is anything I might have overlooked in the process.

Thank you for your assistance.

privefl · 2023-04-13T06:53:29Z

Can you tell me more?
Are these continuous traits? Binary traits?
What is the sample size?
Can you show one of the QC plots comparing SDs?
What is the estimate you get from LD Score reg?

PEWilliamZhou · 2023-04-13T13:23:46Z

Thank you for your quick response!
These are all continuous traits, specifically proteomics traits. For instance, take the trait mentioned above (5810_25_TDGF1_Cripto), which has effective sample sizes (n_eff) of 34,527~35,449. The number of SNPs that passed QC is 636,782, while 750,760 did not pass (for most traits, more than 1,380,000 matched and over 1,000,000 passed QC, but they still have h2 >1). The LDSC results are as follows:
int 0.940522128
int_se 0.010463805
h2 1.620702647
h2_se 0.943200144
Here's the QC figure:

privefl · 2023-04-14T06:53:41Z

There is something going on with this QC plot; you should not filter all these variants.

Were the sumstats derived from a linear mixed model like BOLT-LMM?
Do you have a mix of both simple linear reg and mixed model maybe?

Could you color this plot by chromosome, and then by INFO score (if you have this info).

PEWilliamZhou · 2023-04-14T12:49:56Z

I used deCODE summary statistics from icelandic population (linear mixed model implemented in BOLT-LMM). I don't have INFO.

I also want to let you know that I tried using both ld <- c(ld, Matrix::colSums(corr_chr^2)) (calculated by selected SNP matrix) and the ld in your hm3+, and tried using your af_UKBB in map file instead of af from deCODE, and they all showed bad results with h2>1.
You mentioned that I should not filter all these variants. Can you tell me the correct way to perform QC?

I would like to share with you my code related to LDpred2-auto.
part of my code.txt

privefl · 2023-04-14T13:48:46Z

Could you provide the link where I can download these sumstats?

PEWilliamZhou · 2023-04-14T13:58:03Z

Sure! Here is the link to the summary statistics file:
list43.txt
I recommend starting with the first 6 lines. I could only obtain reasonable results for SIGLEC9_Siglec_9, while the other traits provided incorrect h2 estimates.
This is deCODE readme file.
proteomics_readme.txt

PEWilliamZhou · 2023-04-18T14:01:07Z

Could you provide the link where I can download these sumstats?

Dear Privefl,

Have you tried with the summary statistics? Please let me know if you found something, or if you need any further information.

privefl · 2023-04-19T06:51:33Z

I've tried on the first one, and have the same problem.

Given that these are mixed-model sumstats (which breaks some assumptions of polygenic score models), and that some effects are so large, I think it is quite difficult to estimate the heritability here.

If you can get normal linear regression summary statistics, I think it would help.

PEWilliamZhou · 2023-04-24T10:47:38Z

Thank you very much for your help!

If I were to continue using the mixed-model summary statistics and do not have a validation set available, do you have any recommendations or suggestions on how to proceed?

privefl · 2023-04-24T12:07:08Z

Depends on what you want to do:

if only compute PGS, maybe keep all variants (there is no clear outliers)
I'm not sure you can reliably estimate h2 from this (with any sumstats method whatsoever)

PEWilliamZhou · 2023-04-27T21:44:21Z

Dear author,
Thank you for your help.

I tried LDpred2-auto and your HM3+ on two other sumstats with linear model: FENLAND study and AGES study (selecting several proteomics traits that have h2 ~0.7). I still got h2 >1.

I think my code is correct, since I can get normal result with your "public-data3-sumstats.txt".

Could this be because the external LD reference is being used instead of the LD from the summary statistics samples?

My aim is to compute those PGS without validation set. If I cannot estimate correct h2, is there a way to get a reliable PGS?

privefl · 2023-04-28T06:16:46Z

Getting a good estimate of h2 is not a guarantee for getting a good PGS.
Similarly, a bad estimate does not necessarily mean that you get a bad PGS.
Prediction and inference are kind of two different things.
What are the range that you get?

Do you have links for the FENLAND / AGES sumstats?

PEWilliamZhou · 2023-04-28T12:13:35Z

a_trait.zip
For this traits, range is around 0.97.

https://drive.google.com/file/d/1lmnpNDpPiwliMvd2Bb33hPpzzVJkLFp9/view?usp=share_link
Its ranges are around 1.6 or around 0.001

privefl · 2023-04-28T13:18:17Z

What is the sample size for the second one?

PEWilliamZhou · 2023-04-28T13:20:42Z

What is the sample size for the second one?

5368

privefl · 2023-04-28T13:50:07Z

Using sd(y)=1, the QC plot is very good.

I'll run LDpred2-auto now.

privefl · 2023-04-28T14:08:46Z

LD score reg results:

       int     int_se         h2      h2_se 
1.17693343 0.01450483 2.83327279 2.82301985

Large effects and small N -> complicated

privefl · 2023-04-28T17:04:36Z

Indeed, all range are 1.60-1.61 and the heritability estimate is 3.1 [2.6-3.6].

If you look at the estimated r2 for each variant separately
r2 <- with(sumstats, beta^2 / (beta^2 + n_eff * beta_se^2))
there is one variant that explains 82.5% of the phenotypic variance.

I guess LDpred2 cannot cope with such large effects, especially with such a small GWAS sample size.
A common strategy in this case is to use this effect as fixed with its GWAS sample size (or use a few variants with COJO), and make a predictor with LDpred2 based on the residualized sumstats (you can get this from COJO maybe, or just consider all other variants that are not correlated with the top one, e.g. from different LD blocks).

bvilhjal · 2023-04-28T18:20:35Z

Running LDpred2 on COJO residualized summary stats is maybe not a very common idea, but a good idea in this situation. Using COJO you might have to account for a couple of variants. Also, when residualizing the GWAS sum stats it's a good idea to update even the distant (not in LD) effect sizes, as their relative effect sizes will increase when a large part of the phenotypic variance is explained. See e.g. https://www.medrxiv.org/content/10.1101/2022.11.09.22281216v1 as an example.

privefl · 2023-04-28T19:37:30Z

I don't think effect sizes would change, just their SE would be smaller.

bvilhjal · 2023-04-28T20:15:01Z

Yes, true. It becomes more significant, and therefore the z-score increases.

privefl · 2023-06-30T07:32:05Z

I am trying to reproduce this behavior in simulations, but this is a bit challenging.

xuxwen · 2023-06-30T08:30:00Z

a_trait.zip For this traits, range is around 0.97.

https://drive.google.com/file/d/1lmnpNDpPiwliMvd2Bb33hPpzzVJkLFp9/view?usp=share_link Its ranges are around 1.6 or around 0.001
Hello, I would like to know if this AGES data you provided is the QTL data of the protein, is this a protein data, I downloaded and looked at it and I don't know if it is a protein data, there are about 7.5 million SNP numbers in total.

PEWilliamZhou · 2023-06-30T08:37:23Z

Hello. Yes it is protein data. I forget which one it is but most proteomics data get abnormal h2.

xuxwen · 2023-06-30T08:40:34Z

Hello. Yes it is protein data. I forget which one it is but most proteomics data get abnormal h2.

We used the same data as you used for Iceland, and h2 had the same anomaly and periodic failure to converge. I did not find this AGES data you used, can you provide a link to the article.

privefl · 2023-06-30T12:49:56Z

An update on the simulations. I can definitely reproduce the very large LDSc h2 estimates.
But then I get normal h2 with LDpred2-auto in simulations.
I also get normal LDSc intercept.
Anything going on with the 1.177 estimate we get for the intercept? Pop structure or relatedness maybe?

bvilhjal · 2023-06-30T13:05:30Z

I think training LDpred2 on the Ferkingstad et al. (Nat Genet 2021) data (currently) has some challenges. First, these are BOLT-LMM summary statistics, which are different from standard least square or logistic regression marginal effect sizes. The BOLT-LMM sum stats might also be slightly inflated due to family structure (see e.g. Jiang et al., Nat Genet 2019). Second, the sample has rather strong sample ascertainment, which has been known to bias heritability estimates. According to the paper, the participants are heavily enriched to be cancer patients and likely with for other diagnoses as well. Third, assuming a point-Normal prior for some protein levels might simply be suboptimal?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unusually High h2est Values: Exceeding 1 and Reaching up to 2 or 3 #420

Unusually High h2est Values: Exceeding 1 and Reaching up to 2 or 3 #420

PEWilliamZhou commented Apr 12, 2023

privefl commented Apr 13, 2023

PEWilliamZhou commented Apr 13, 2023 •

edited

Loading

privefl commented Apr 14, 2023

PEWilliamZhou commented Apr 14, 2023

privefl commented Apr 14, 2023

PEWilliamZhou commented Apr 14, 2023

PEWilliamZhou commented Apr 18, 2023

privefl commented Apr 19, 2023

PEWilliamZhou commented Apr 24, 2023

privefl commented Apr 24, 2023

PEWilliamZhou commented Apr 27, 2023

privefl commented Apr 28, 2023

PEWilliamZhou commented Apr 28, 2023

privefl commented Apr 28, 2023

PEWilliamZhou commented Apr 28, 2023

privefl commented Apr 28, 2023

privefl commented Apr 28, 2023

privefl commented Apr 28, 2023 •

edited

Loading

bvilhjal commented Apr 28, 2023

privefl commented Apr 28, 2023

bvilhjal commented Apr 28, 2023

privefl commented Jun 30, 2023

xuxwen commented Jun 30, 2023

PEWilliamZhou commented Jun 30, 2023

xuxwen commented Jun 30, 2023

privefl commented Jun 30, 2023

bvilhjal commented Jun 30, 2023 •

edited

Loading

Unusually High h2est Values: Exceeding 1 and Reaching up to 2 or 3 #420

Unusually High h2est Values: Exceeding 1 and Reaching up to 2 or 3 #420

Comments

PEWilliamZhou commented Apr 12, 2023

privefl commented Apr 13, 2023

PEWilliamZhou commented Apr 13, 2023 • edited Loading

privefl commented Apr 14, 2023

PEWilliamZhou commented Apr 14, 2023

privefl commented Apr 14, 2023

PEWilliamZhou commented Apr 14, 2023

PEWilliamZhou commented Apr 18, 2023

privefl commented Apr 19, 2023

PEWilliamZhou commented Apr 24, 2023

privefl commented Apr 24, 2023

PEWilliamZhou commented Apr 27, 2023

privefl commented Apr 28, 2023

PEWilliamZhou commented Apr 28, 2023

privefl commented Apr 28, 2023

PEWilliamZhou commented Apr 28, 2023

privefl commented Apr 28, 2023

privefl commented Apr 28, 2023

privefl commented Apr 28, 2023 • edited Loading

bvilhjal commented Apr 28, 2023

privefl commented Apr 28, 2023

bvilhjal commented Apr 28, 2023

privefl commented Jun 30, 2023

xuxwen commented Jun 30, 2023

PEWilliamZhou commented Jun 30, 2023

xuxwen commented Jun 30, 2023

privefl commented Jun 30, 2023

bvilhjal commented Jun 30, 2023 • edited Loading

PEWilliamZhou commented Apr 13, 2023 •

edited

Loading

privefl commented Apr 28, 2023 •

edited

Loading

bvilhjal commented Jun 30, 2023 •

edited

Loading