Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace HmVCF with code based on pgscatalog_utils/match_variants #20

Open
smlmbrt opened this issue Feb 3, 2023 · 2 comments
Open

Replace HmVCF with code based on pgscatalog_utils/match_variants #20

smlmbrt opened this issue Feb 3, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@smlmbrt
Copy link
Member

smlmbrt commented Feb 3, 2023

The current code is very slow - using match_variants would be much faster and more consistent with what we're doing in pgsc_calc.

Would use a variant of this script: https://github.com/PGScatalog/pgscatalog_utils/blob/main/pgscatalog_utils/match/match_variants.py

Currently it outputs a csv of all possible matches in the log file (along with which one was best), example:

row_nr accession chr_name chr_position effect_allele other_allele effect_weight effect_type ID REF ALT matched_effect_allele match_type is_multiallelic ambiguous match_flipped best_match exclude duplicate_best_match duplicate_ID match_IDs match_status dataset
0 PGS000018_hmPOS_GRCh37 1 2245570 G C -0.0276009 additive 1:2245570:C:G C G G altref false true false true true false false false excluded 1000G-chr2
0 PGS000018_hmPOS_GRCh37 1 2245570 G C -0.0276009 additive 1:2245570:C:G C G C refalt_flip false true true false true false false false not_best 1000G-chr2
1 PGS000018_hmPOS_GRCh37 1 22132518 G A 0.023934 additive 1:22132518:G:A G A G refalt false false false true true false false false excluded 1000G-chr2
2 PGS000018_hmPOS_GRCh37 1 38386727 G A -0.0174935 additive 1:38386727:G:A G A G refalt false false false true true false false false excluded 1000G-chr2
3 PGS000018_hmPOS_GRCh37 1 55496039 T C 0.0293005 additive 1:55496039:T:C T C T refalt false false false true true false false false excluded 1000G-chr2
8 PGS000018_hmPOS_GRCh37 1 110298166 G C 0.0245969 additive 1:110298166:G:C G C G refalt false true false true true false false false excluded 1000G-chr2
8 PGS000018_hmPOS_GRCh37 1 110298166 G C 0.0245969 additive 1:110298166:G:C G C C altref_flip false true true false true false false false not_best 1000G-chr2
9 PGS000018_hmPOS_GRCh37 1 151762308 G C 0.0209215 additive 1:151762308:C:G C G G altref false true false true true false false false excluded 1000G-chr2
9 PGS000018_hmPOS_GRCh37 1 151762308 G C 0.0209215 additive 1:151762308:C:G C G C refalt_flip false true true false true false false false not_best 1000G-chr2
10 PGS000018_hmPOS_GRCh37 1 154395946 G A -0.0197906 additive 1:154395946:A:G A G G altref false false false true true false false false excluded 1000G-chr2
28 PGS000018_hmPOS_GRCh37 2 164945044 G C 0.0213456 additive 2:164945044:G:C G C G refalt false true false true true false false true excluded 1000G-chr2
28 PGS000018_hmPOS_GRCh37 2 164945044 G C 0.0213456 additive 2:164945044:G:C G C C altref_flip false true true false true false false true not_best 1000G-chr2
29 PGS000018_hmPOS_GRCh37 2 202799924 C T -0.0226885 additive 2:202799924:T:C T C C altref false false false true false false false true matched 1000G-chr2
30 PGS000018_hmPOS_GRCh37 2 203829225 A C -0.0526925 additive 2:203829225:A:C A C A refalt false false false true false false false true matched 1000G-chr2

The rows of this file could then be processed into the output of the current HmVCF : e.g. a single row per scoring file variant with information about how it was matched or excluded (harmonisation code)

@smlmbrt smlmbrt added the enhancement New feature or request label Feb 3, 2023
@ens-lgil
Copy link
Member

ens-lgil commented Feb 9, 2023

I am wondering if we should/could separate the match_variants() call from the argparser so we could use the module directly (and set the parameters independently, e.g.:

def match_variants():
    args = _parse_args()
    logging_level = args.verbose
    polars_threads = args.n_threads
    tmpdir = args.outdir
    OUTDIR = args.outdir
    scorefile=args.scorefile
    chrom=args.chrom
    run_match_variants(logging_level, polars_threads, tmpdir, OUTDIR, scorefile, chrom)
    

 def run_match_variants(logging_level, polars_threads, tmpdir, OUTDIR, scorefile, chrom):
    config.set_logging_level(logging_level)
    config.setup_polars_threads(polars_threads)
    config.setup_tmpdir(tmpdir)
    config.setup_cleaning()
    config.OUTDIR = OUTDIR
    ... <rest_of_the_match_variants_method> ...

Then we could use directly the run_match_variants method in pgs-harmonizer

@smlmbrt
Copy link
Member Author

smlmbrt commented Feb 10, 2023

I think that would work - I guess you could draft it as a PR? Although I think the output will have to be edited (the log_and_write) function so maybe you have to use the actual methods directly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants