-
Notifications
You must be signed in to change notification settings - Fork 81
Number of variants in VCF and HTML summary do not match
First of all, SnpEff probably giving you the right numbers, the mismatch might not be a bug, but a simple interpretation issue.
It is important to remember that the VCF format specification allows having multiple variants in a single line. Also, a single variant can have more than one annotation, due to:
- Multiple transcripts (isoforms) of a gene.
- Multiple (overlapping) genes in the genomic location of the variant.
- the variant spanning multiple genes (e.g. a translocation, large deletion, etc.)
When you count the number of variants, you must keep all these in mind to count them properly. Obviously, SnpEff does take all this into account when counting the variants for the summary HTML.
Many people who claim that there is a mismatch between the number of variants in the summary (HTML) file and the number of variants in the VCF file, are just making mistakes when counting the variants because they forget one or more of these previous items.
A typical scenario is, for example, that people are "counting missense variants" using something like this:
grep missense file.vcf | wc -l
This is counting "lines in a VCF file that have at least one missense variants", as opposed to counting "missense annotations" and, as mentioned previously, the number of lines in a VCF file is not the same as the number of annotations or the number of variants.