Sumstats: Join up chromosome and position #137

deepchocolate · 2023-02-21T15:56:18Z

Is your feature request related to a problem? Please describe.
I think this is an issue with sumstats produces by meta analyses using metal. The headers of such files looks like:

Thus chomosome and position is lacking in this file. I think both prsice and ldpred2 (bigsnpr::snp_match) use this information for quality control. To simplify the process to the user it would be nice if this information could be appended, I guess using the HRC data.

Describe the solution you'd like
Provide a tool to append this information to sumstats.

@espenhgn @ofrei What do you think is the best way of doing this?

The text was updated successfully, but these errors were encountered:

ofrei · 2023-03-08T11:09:35Z

hi @deepchocolate , and sorry for slow response!

I agree this is an important issue that should be addressed one way or another, i.e. we need to augment METAL output with CHR and BP columns. Also it's a reasonable strategy to do this based on HRC reference. There are some limitatinos with this approach (e.g. some fraction of SNPs may have rsID values that are inconsistent in sumstats and in HRC; also some of them might be missing in HRC) - but none of these are major show stoppers.

A more robust approach would be to apply https://github.com/BioPsyk/cleansumstats/ pipeline to each sumstat prior to running METAL, and then also apply this pipeline to the output of the METAL script. The point of applying it prior to METAL is to harmonize rs# across all summary statistics being meta-analized. This can improve meta-analysis because METAL merges SNPs based on rs# (and also looking at A1 / A2 ), but ignoring chromosomes and positions.

My suggestion would be to

expend gwas.py script with an optional flag that would produce scripts for running cleansumstats pipeline (presuming that the pipeline is already deployed as described in https://github.com/BioPsyk/cleansumstats/blob/master/README.md )
produce scripts for METAL analysis that apply cleansumstats pipeline both before and after running METAL.

deepchocolate · 2023-03-08T12:24:00Z

@ofrei Sounds great! Is the cleansumstats pipeline available in the containers or is it something to implement?

In addition to your suggestions, it's probably a good idea to also provide the cleansumstats pipeline in the prs.py-script as you can run into the same issue (missing CHR/COL) with publically available sumstats?

For now, in the EOMDD project I have just shared the the branch for PR #143 to the analysts. For the EOMDD sumstats, there's a lot of SNPs that are not matched to the HRC data by RSID. Though I guess it's enough for ldpred2 as the about 1M hapmap3 SNPs are still retained.

ofrei · 2023-03-08T13:07:54Z

cleansumstats pipeline is available here: https://github.com/BioPsyk/cleansumstats/

ofrei · 2023-03-08T15:07:11Z

for the prs.py I would rather not add a option to automatically run cleansumstats pipeline as a pre-processing step. This is because PRS is rarely done 100% from the first attempt - normally it's multiple analyses needed. So it makes sense for the user to run the "cleansumstats.py" pipeline just once as a pre-processing step, and then prs.py can require CHR and BP columns to be in place. Overall we could expect that all sumstats are formatted according to the description here: https://github.com/comorment/containers/blob/main/gwas/sumstats_specification.md (this perhaps needs more clarify about which columns are required, and which are optional).

ofrei · 2023-03-08T15:09:48Z

I have just shared the the branch for PR #143 to the analysts

Sounds good. Also using 1M hapmap3 SNPs sound reasonable to me - we can start using https://github.com/comorment/ldpred2_ref as reference because this was produced by LDpred2; it's also good to use the same reference across sites for improved consistency, rather than all sites generate a local reference

deepchocolate · 2023-03-13T14:45:04Z

@ofrei By the way, shall I skip PR #143 or proceed with it?

ofrei · 2023-03-13T15:59:30Z

To me this is fine to merge, it's a good script to have

deepchocolate · 2023-03-27T13:40:15Z

@ofrei @espenhgn What's the appropriate way of incorporating the cleansumstats pipeline? Fork the repository?

github-actions · 2023-06-26T02:08:18Z

This issue appears to be stale due to non-activity

deepchocolate mentioned this issue Mar 1, 2023

Sumstats: Add column and position #143

Merged

4 tasks

github-actions bot added the no-issue-activity label Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sumstats: Join up chromosome and position #137

Sumstats: Join up chromosome and position #137

deepchocolate commented Feb 21, 2023

ofrei commented Mar 8, 2023

deepchocolate commented Mar 8, 2023

ofrei commented Mar 8, 2023

ofrei commented Mar 8, 2023

ofrei commented Mar 8, 2023 •

edited

Loading

deepchocolate commented Mar 13, 2023

ofrei commented Mar 13, 2023

deepchocolate commented Mar 27, 2023

github-actions bot commented Jun 26, 2023

Sumstats: Join up chromosome and position #137

Sumstats: Join up chromosome and position #137

Comments

deepchocolate commented Feb 21, 2023

ofrei commented Mar 8, 2023

deepchocolate commented Mar 8, 2023

ofrei commented Mar 8, 2023

ofrei commented Mar 8, 2023

ofrei commented Mar 8, 2023 • edited Loading

deepchocolate commented Mar 13, 2023

ofrei commented Mar 13, 2023

deepchocolate commented Mar 27, 2023

github-actions bot commented Jun 26, 2023

ofrei commented Mar 8, 2023 •

edited

Loading