Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: duplicate merging and selecting #2342

Open
davmlaw opened this issue Dec 18, 2024 · 0 comments
Open

feature request: duplicate merging and selecting #2342

davmlaw opened this issue Dec 18, 2024 · 0 comments

Comments

@davmlaw
Copy link

davmlaw commented Dec 18, 2024

My use case is lifting over gnomAD v4.1 from GRCh38 to T2T-CHM13v2.0 and sometimes multiple GRCh38 variants resolve to the same T2T coordinate - I want to be able to process these duplicates (say picking highest or lowest AF) rather than just taking the first in the file

Control how selecting duplicates works

It would be really useful to be able to choose which one to take. You could do this by defining how to sort the dupes then taking the 1st, for instance take the one with the highest AF, then highest AC with --rm-dup-sort=-AF,-AC or --rm-dup--sort=AF:desc,AC:desc

Merge functionality with duplicates

Merge has --info-rules which works with the same variant across different files. It would be nice to be able to apply this to same variant in the same file, for instance norm --rm-dup --info-rules=BCFTOOLS_OLD_VARIANT:join would have allowed a workaround for this issue

Mark duplicates

Another way to solve this would be to mark duplicates rather than remove them, for instance a DUPLICATE flag.

Then I could select them out into a separate file and:

  • Use existing merge to bring them back with --info-rules
  • Process this much smaller file in Python and process them however I want then merge back (much quicker than processing ~100G of compressed VCF in Python)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant