-
Notifications
You must be signed in to change notification settings - Fork 81
Corrupted ClinVar database VCF file
It looks lik that some ClinVar versions have illegal VCF values, which will make downstream analysis tools, such as SnpSift
to report the errors.
For example, if you look into the file:
$ curl -s ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar.vcf.gz | gunzip -c | grep "&base" | head -n 1
13 32890543 125955 G A . . ALLELEID=131493;CLNDISDB=MedGen:C2675520,OMIM:612555;CLNDN=Breast-ovarian_cancer,_familial_2;CLNHGVS=NC_000013.10:g.32890543G>A;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=Breast_Cancer_Information_Core__(BRCA2):190-16&base_change=G_to_A;GENEINFO=BRCA2:675;MC=SO:0001627|intron_variant;ORIGIN=1;RS=276174799
As you can see, the "CLNVI" is:
CLNVI=Breast_Cancer_Information_Core__(BRCA2):190-16&base_change=G_to_A
This means that the CLNVI contains an illegal '='
character.
The VCF specification clearly states that the equal sign is not allowed:
Reference: https://samtools.github.io/hts-specs/VCFv4.3.pdf
Section 1.2: "Character encoding, non-printable characters and characters with special meaning"
Characters with special meaning (such as field delimiters ’;’ in INFO or ’:’ FORMAT fields) must be represented
using the capitalized percent encoding:
%3A : (colon)
%3B ; (semicolon)
%3D = (equal sign)
...
Furthermore, section 1.6.1.8 specifies:
INFO - additional information: (String, no semi-colons or equals-signs permitted; commas are permitted only
as delimiters for lists of values; characters with special meaning can be encoded using the percent encoding, see
Section 1.2; space characters are allowed)
An easy way to find many of the problems in the VCF file is to use the SnpSift checkVcf
command:
$ java -jar SnpSift.jar vcfCheck clinvar.vcf.gz 2>&1 | head
...WARNING: Malformed VCF entryfile '/home/pcingola/Downloads/clinvar.vcf.gz', line 3655:
Entry : 1 25717365 17708 C C . . ALLELEID=32747;CLNDISDB=.;CLNDN=RH_E/e_POLYMORPHISM;CLNHGVS=NC_000001.10:g.25717365C=;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=Benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=OMIM_Allelic_Variant:111700.0001;GENEINFO=RHCE:6006;MC=SO:0001627|intron_variant,SO:0001819|synonymous_variant;ORIGIN=1;RS=609320
Errors :
INFO filed 'CLNHGVS' has an invalid value 'NC_000001.10:g.25717365C=' (no spaces, tabs, '=' or ';' are allowed)
WARNING: Malformed VCF entryfile '/home/pcingola/Downloads/clinvar.vcf.gz', line 3657:
Entry : 1 25735202 242743 G G . . ALLELEID=38411;CLNHGVS=NC_000001.10:g.25735202G=;CLNREVSTAT=no_interpretation_for_the_single_variant;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=OMIM_Allelic_Variant:111700.0002;GENEINFO=RHCE:6006;MC=SO:0001819|synonymous_variant;ORIGIN=1;RS=676785;SSR=1;CLNDISDBINCL=.;CLNDNINCL=RH_C/c_POLYMORPHISM;CLNSIGINCL=17709:Benign
Errors :
INFO filed 'CLNHGVS' has an invalid value 'NC_000001.10:g.25735202G=' (no spaces, tabs, '=' or ';' are allowed)
OK, it looks like there are quite a few problems, let's count them:
$ java -jar SnpSift.jar vcfCheck ~/Downloads/clinvar.vcf.gz 2>&1 | grep WARN | wc -l
1793
Well, there seems to be 1793 lines VCF with some sort of problem. Let's see how to fix them.
So, you need to fix ClinVar by either:
- Remove the offending fields from the VCF file
- Fix the character coding for the offending values
This is the easiest way to fix ClinVar's VCF file. First, let's find the corrupted fields:
$ java -jar SnpSift.jar vcfCheck clinvar.vcf.gz 2>&1 | grep "INFO field" | cut -f 2 -d "'" | sort | uniq -c
212 CLNHGVS
1583 CLNVI
OK, there are 212 lines with corrupted CLNHGVS
fields and 1583 lines with corrupted CLNVI
fields.
Let's create a new database without those fields
$ java -jar SnpSift.jar rmInfo clinvar.vcf.gz CLNHGVS CLNVI > clinvar.fixed_1.vcf
00:00:00 Reading STDIN
00:00:03 Done
# Let's also compress and index the new file so we can use it as a database
$ bgzip clinvar.fixed_1.vcf
$ tabix clinvar.fixed_1.vcf.gz
Now we can re-check the new file to make sure it's OK.
$ java -jar SnpSift.jar vcfCheck clinvar.fixed_1.vcf.gz
....................................................................................................
100000 ....................................................................................................
200000 ..................................................................................................
Everything seems OK.
Here we need to fix the encoding of the fields.
We know (see previous section) that the problematic fields are CLNVI
and CLNHGVS
, so what exactly are the problems?
$ java -jar SnpSift.jar vcfCheck clinvar.vcf.gz 2>&1 | grep "INFO field" | grep CLNVI | head
INFO field 'CLNVI' has an invalid value 'Breast_Cancer_Information_Core__(BRCA2):190-16&base_change=G_to_A' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNVI' has an invalid value 'Breast_Cancer_Information_Core__(BRCA2):190-12&base_change=del_TCT' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNVI' has an invalid value 'Breast_Cancer_Information_Core__(BRCA2):190-5&base_change=del_T' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNVI' has an invalid value 'Breast_Cancer_Information_Core__(BRCA2):190-7&base_change=T_to_C' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNVI' has an invalid value 'Breast_Cancer_Information_Core__(BRCA2):195&base_change=T_to_C' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNVI' has an invalid value 'Breast_Cancer_Information_Core__(BRCA2):203&base_change=G_to_A' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNVI' has an invalid value 'Breast_Cancer_Information_Core__(BRCA2):203&base_change=G_to_C' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNVI' has an invalid value 'Breast_Cancer_Information_Core__(BRCA2):203&base_change=G_to_T' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNVI' has an invalid value 'Breast_Cancer_Information_Core__(BRCA2):214&base_change=A_to_C' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNVI' has an invalid value 'Breast_Cancer_Information_Core__(BRCA2):215&base_change=T_to_C' (no spaces, tabs, '=' or ';' are allowed)
OK, this one seems easy: All we need to do is change &base_change=
to &base_change%3D
This can be done with a simple sed
command:
$ zcat clinvar.vcf.gz | sed 's/\&base_change=/\&base_change%3D/g'
How about the other field?
$ java -jar SnpSift.jar vcfCheck clinvar.vcf.gz 2>&1 | grep "INFO field" | grep CLNHGVS | head
INFO field 'CLNHGVS' has an invalid value 'NC_000001.10:g.25717365C=' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNHGVS' has an invalid value 'NC_000001.10:g.25735202G=' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNHGVS' has an invalid value 'NC_000001.10:g.25735306T=' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNHGVS' has an invalid value 'NC_000001.10:g.25735331G=' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNHGVS' has an invalid value 'NC_000001.10:g.94578548T=' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNHGVS' has an invalid value 'NC_000001.10:g.98348885G=' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNHGVS' has an invalid value 'NC_000001.10:g.100672060T=' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNHGVS' has an invalid value 'NC_000001.10:g.114377568A=' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNHGVS' has an invalid value 'NC_000001.10:g.161599571T=' (no spaces, tabs, '=' or ';' are allowed)
INFO field 'CLNHGVS' has an invalid value 'NC_000001.10:g.161599643T=' (no spaces, tabs, '=' or ';' are allowed)
This is essentially the same, but we need four sed
commands (one for each base):
$ zcat clinvar.vcf.gz \
| sed 's/A=;/A%3D;/' \
| sed 's/C=;/C%3D;/' \
| sed 's/G=;/G%3D;/' \
| sed 's/T=;/T%3D;/'
The fix:
Now, let's put the two previously explained fixes toghether:
$ zcat clinvar.vcf.gz \
| sed 's/\&base_change=/\&base_change%3D/g' \
| sed 's/A=;/A%3D;/' \
| sed 's/C=;/C%3D;/' \
| sed 's/G=;/G%3D;/' \
| sed 's/T=;/T%3D;/' \
> clinvar.fixed.vcf
# Let's also compress and index the new file so we can use it as a database
$ bgzip clinvar.fixed.vcf
$ tabix clinvar.fixed.vcf.gz
We re-check the new
$ java -jar SnpSift.jar vcfCheck clinvar.fixed.vcf.gz
....................................................................................................
100000 ....................................................................................................
200000 ..................................................................................................
OK, we are done.