Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sumstats: Join up chromosome and position #137

Open
deepchocolate opened this issue Feb 21, 2023 · 9 comments
Open

Sumstats: Join up chromosome and position #137

deepchocolate opened this issue Feb 21, 2023 · 9 comments

Comments

@deepchocolate
Copy link
Contributor

Is your feature request related to a problem? Please describe.
I think this is an issue with sumstats produces by meta analyses using metal. The headers of such files looks like:
image
Thus chomosome and position is lacking in this file. I think both prsice and ldpred2 (bigsnpr::snp_match) use this information for quality control. To simplify the process to the user it would be nice if this information could be appended, I guess using the HRC data.

Describe the solution you'd like
Provide a tool to append this information to sumstats.

@espenhgn @ofrei What do you think is the best way of doing this?

@ofrei
Copy link
Contributor

ofrei commented Mar 8, 2023

hi @deepchocolate , and sorry for slow response!

I agree this is an important issue that should be addressed one way or another, i.e. we need to augment METAL output with CHR and BP columns. Also it's a reasonable strategy to do this based on HRC reference. There are some limitatinos with this approach (e.g. some fraction of SNPs may have rsID values that are inconsistent in sumstats and in HRC; also some of them might be missing in HRC) - but none of these are major show stoppers.

A more robust approach would be to apply https://github.com/BioPsyk/cleansumstats/ pipeline to each sumstat prior to running METAL, and then also apply this pipeline to the output of the METAL script. The point of applying it prior to METAL is to harmonize rs# across all summary statistics being meta-analized. This can improve meta-analysis because METAL merges SNPs based on rs# (and also looking at A1 / A2 ), but ignoring chromosomes and positions.

My suggestion would be to

  • expend gwas.py script with an optional flag that would produce scripts for running cleansumstats pipeline (presuming that the pipeline is already deployed as described in https://github.com/BioPsyk/cleansumstats/blob/master/README.md )
  • produce scripts for METAL analysis that apply cleansumstats pipeline both before and after running METAL.

@deepchocolate
Copy link
Contributor Author

@ofrei Sounds great! Is the cleansumstats pipeline available in the containers or is it something to implement?

In addition to your suggestions, it's probably a good idea to also provide the cleansumstats pipeline in the prs.py-script as you can run into the same issue (missing CHR/COL) with publically available sumstats?

For now, in the EOMDD project I have just shared the the branch for PR #143 to the analysts. For the EOMDD sumstats, there's a lot of SNPs that are not matched to the HRC data by RSID. Though I guess it's enough for ldpred2 as the about 1M hapmap3 SNPs are still retained.

@ofrei
Copy link
Contributor

ofrei commented Mar 8, 2023

cleansumstats pipeline is available here: https://github.com/BioPsyk/cleansumstats/

@ofrei
Copy link
Contributor

ofrei commented Mar 8, 2023

for the prs.py I would rather not add a option to automatically run cleansumstats pipeline as a pre-processing step. This is because PRS is rarely done 100% from the first attempt - normally it's multiple analyses needed. So it makes sense for the user to run the "cleansumstats.py" pipeline just once as a pre-processing step, and then prs.py can require CHR and BP columns to be in place. Overall we could expect that all sumstats are formatted according to the description here: https://github.com/comorment/containers/blob/main/gwas/sumstats_specification.md (this perhaps needs more clarify about which columns are required, and which are optional).

@ofrei
Copy link
Contributor

ofrei commented Mar 8, 2023

I have just shared the the branch for PR #143 to the analysts

Sounds good. Also using 1M hapmap3 SNPs sound reasonable to me - we can start using https://github.com/comorment/ldpred2_ref as reference because this was produced by LDpred2; it's also good to use the same reference across sites for improved consistency, rather than all sites generate a local reference

@deepchocolate
Copy link
Contributor Author

@ofrei By the way, shall I skip PR #143 or proceed with it?

@ofrei
Copy link
Contributor

ofrei commented Mar 13, 2023

To me this is fine to merge, it's a good script to have

@deepchocolate
Copy link
Contributor Author

@ofrei @espenhgn What's the appropriate way of incorporating the cleansumstats pipeline? Fork the repository?

@github-actions
Copy link

This issue appears to be stale due to non-activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

2 participants