Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to subset .vcf.gz file to include only variants whose genomic coordinates are given in a list #2332

Open
annilk opened this issue Dec 9, 2024 · 3 comments

Comments

@annilk
Copy link

annilk commented Dec 9, 2024

Hi,

I would like to create a subset of a large .vcf.gz file so that I would be able to read it in R with read.vcfR from the vcfR package (I get memory issues if I try to read the non-subsetted .vcf.gz file). I only need certain variants given in a list. What I have tried:

~/bcftools-1.12/bcftools view -T snplist.txt hbcs_sisu_b38.vcf.gz -o hbcs_sisu_b38_subset.vcf.gz

The 'snplist.txt' is tab-delimited and includes columns '#CHROM' and 'POS' (not sure if they were required).

I have also tried option '-R' instead of '-T' for the 'view' command, and command 'filter' instead of 'view' with both options '-T' and '-R'. But depending on which variants are included in snplist.txt, in the subsetted there is always either just one variant or no variants at all, even though

less -S hbcs_sisu_b38.vcf.gz | grep -f snplist.txt

prints lines for more variants.

I am not sure if .csi file was required here, but I have created hbcs_sisu_b38.vcf.gz.csi like this:

~/bcftools-1.12/bcftools index hbcs_sisu_b38.vcf.gz

@pd3
Copy link
Member

pd3 commented Dec 9, 2024

The command looks correct. This is a very basic functionality, so it's strange it wouldn't work. Can you try to upgrade to the latest version of bcftools, we are at 1.21 now. If there is something wrong with the input data, the newer version might give some informative error messages.

The -T option does not require an index, so it's unlikely that it is the problem.

If upgrading does not help, can you provide a small test case for us to reproduce the problem?

@annilk
Copy link
Author

annilk commented Dec 9, 2024

Hi,

Thanks for the fast reply. I downloaded and installed version 1.21 but now I get an error message saying 'Could not parse 2-th line of file snplist.txt, using the columns 1,2[,3] Failed to read the targets: snplist.txt'

Here is a head of snplist.txt:
#CHROM POS
1 19831748
1 30185237
1 30187395

Head of hbcs_sisu_b38.vcf.gz would be quite massive so I copy-pasted here only seven first columns of the output when I run less -S hbcs_sisu_b38.vcf.gz | grep -f snplist.txt:

#CHROM POS ID REF ALT QUAL FILTER
chr1 19831748 rs4509550 T C . PASS
chr1 30185237 rs7536179 T C . PASS
chr1 30187395 rs11371593 T TG . PASS

@annilk
Copy link
Author

annilk commented Dec 14, 2024

Let me know if you need more information to be able to reproduce the problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants