Skip to content

Approach to VARIantVARIant interaction through VARIable thresholds and hypothesis testing. VARI3 automates the selection and analysis of the most promising SNPs to identify epistasis.

Notifications You must be signed in to change notification settings

alexcis95/VARI3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Package VARI3

Approach to VARIant-VARIant interaction through VARIable thresholds and hypothesis testing.
VARI3 automates the selection and analysis of the most promising SNPs to identify epistasis.

This package contain:

VARI3 function.
TLTO function, Two Locus To Odd.

Requirements

To use the VARI3 function you must have Plink and ANNOVAR installed. Therefore, before explaining the operation of the function we will give the basic details for the installation of the tools.

Plink

To download Plink we can do it from the link: https://www.cog-genomics.org/plink2

We download the version for our operating system, in our case Linux 64-bit and decompress the plink file in the directory where we want to leave it.

ANNOVAR

To download ANNOVAR we can do it from the link: http://annovar.openbioinformatics.org/en/latest/user-guide/download/#-for-gene-based-annotation

First, we have to sign in, then we receive an email in which we have the following link: http://www.openbioinformatics.org/annovar/download/0wgxR2rIVP/annovar.latest.tar.gz
Once we have downloaded ANNOVAR we unzip the file. You will see that you have a folder called annovar where there are several Perl files with the suffix pl. (Note that if you have already added the ANNOVAR path to the executable path from your system, then typing annotate_variation.pl would be valid instead of typing perl annotate_variation.pl). First, we need to download the appropriate database files using annotate_variation.pl, we must execute in the terminal the following commands:

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/

annotate_variation.pl -buildver hg19 -downdb cytoBand humandb/

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar exac03 humandb/

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsnp147 humandb/

annotate_variation.pl -buildver hg19 -downdb -webfrom annovar dbnsfp30a humandb/

How to use VARI3

VARI3 automates the selection and analysis of the most promising SNPs to identify epistasis. The final result of the execution is a table with the key data of the epistatic interactions. To use the function it is necessary to have the following libraries installed: library(parallel), library(readr), library(data.table), library(dplyr).
To use VARI3 it is necessary to have ANNOVAR, plink and the genotype of the cohort in bfile format. To comment on the different variables that are configurable in VARI3, let's detail their function and use:

  • bfile The bfile variable is required, the user needs to define the path and name of the Plink bfile (bed, bim and fam) without the extension, the three files must be named in the same way. For example genotype.bed, genotype.bim, genotype.fam the variable bfile would be (bfile = "genotype").

  • out The out variable determines the output directory where all files generated by VARI3 will be written.

  • plink The plink variable is by default "plink" but it is necessary to say the path of the previously downloaded plink file if it has not been added to the system executable path.

  • varpl The varpl variable is by default "annotate_variation.pl", but it is necessary to say the path of the .pl file if it has not been added to the system executable path.

  • build The variable build is used to determine the version of the reference genome, it is by default (build = "hg19").

  • db The variable db is necessary to define where annovar reference databases are located, normally in the downloaded file they follow the path "annovar/humandb/".

We are going to detail the variables that determine the parameters of the analysis:

  • AsT The AsT variable refers to the p.maximum value for selecting SNPs in the association analysis and is by default (AsT = 0.0001).

  • MAF The MAF variable refers to the MAF from which SNPs are selected for analysis, by default (MAF = 0.01) to avoid very rare SNPs.

  • covar The covar variable refers to the file where the covariates are located. For example, (covar = "covarfile.cov"). By default it is null.

  • ncov The ncov variable refers to how many covariates are selected from the covariables file, they are selected depending on the column we want, as in plink. For example to select the first four of a file (ncov = "1-4")

  • ANNOVAR The ANNOVAR variable is by default ANNOVAR = T, for the annotation with ANNOVAR. Note, at the moment the functionality with false is not implemented.

  • Wu The variable Wu refers to the proposed correction for epistasis tests, by default it is Wu = T.

  • clump The clump variable is by default clump = T, to perform clumping. We recommend clumping if the analysis is for more than 2000 people.

  • core The core variable is by default core = detectCores() -1. It refers to the number of cores used.

If we want to make a specific analysis because we have a hypothesis of which genes we are interested in consulting, we need to specify the following variables:

If we want to define the genes in which epistasis will be tested we need the genelist variable. The genelist variable is optional, refers to the file with the list of genes where you want to look for the epistasis, the secondary SNPs, the file must not have header. For example genelist = "gene.txt". For example "gene.txt" file should be of the style:
TP53
MDM2
BCL2

If we also want to specify the genes in which we will select the primary variants we need the primarylist variable. The primarylist variable is optional, it refers to the file with the list of genes where the selection of primary SNPs is searched, the file must not have a header. For example genelist = "prima.txt". For example "prima.txt" file should be of the style:
TP53
MDM2
BCL2

How VARI3 does the analysis

If you don't specify any list of genes

VARI3 automates the selection and analysis of the most promising SNPs in all genome to identify epistasis across the genome.

VARI3 Steps:

First performs an analysis of the association of the SNPs to the phenotype. The analysis may include the selection of covariates by defining the covar and ncov variables. Then if clump variable is (clump = T) clumps to discard the SNPs in LD or not if (clump = F). Next performs the filtering of promising SNPs by enhancing the selection of SNPs with high MAFs. The usual limit of p.value in the GWAS for the selection of SNPs associated to a phenotype is 10^-8 but to promote the search of epistasis we lower this limit with the AsT variable to 10^-5 by default. This allows to select SNPs with a superior MAF in which we can find relevant interactions. We select the SNPs with the lowest p.value as follows: 5 SNPs with a MAF between 0.01-0.05, 5 SNPs with a MAF between 0.05-0.1, 20 SNPs with a MAF between 0.1-0.2, 20 SNPs with a MAF between 0.2-0.3, 20 SNPs with a MAF between 0.3-0.4 and 30 SNPs with a MAF>0.4. Finally, the selected SNPs are then tested against all other SNPs in the genome for epistasis. Results are shown user-friendly in the file epiinform.txt.

For example:

VARI3(bfile = "path/bfile", out = "path", 
plink = "path/plink",varpl = "path/annotate_variation.pl",
db = "path/annovar/humandb/")

If you specify genelist variable

VARI3 automates the selection and analysis of the most promising SNPs in all genome to identify epistasis across the gene in genelist file.

VARI3 Steps:

First performs an analysis of the association of the SNPs to the phenotype. The analysis may include the selection of covariates by defining the covar and ncov variables. Then if clump variable is (clump = T) clumps to discard the SNPs in LD or not if (clump = F). Next performs the filtering of promising SNPs by enhancing the selection of SNPs with high MAFs. The usual limit of p.value in the GWAS for the selection of SNPs associated to a phenotype is 10^-8 but to promote the search of epistasis we lower this limit with the AsT variable to 10^-5 by default. This allows to select SNPs with a superior MAF in which we can find relevant interactions. We select the SNPs with the lowest p.value as follows: 5 SNPs with a MAF between 0.01-0.05, 5 SNPs with a MAF between 0.05-0.1, 20 SNPs with a MAF between 0.1-0.2, 20 SNPs with a MAF between 0.2-0.3, 20 SNPs with a MAF between 0.3-0.4 and 30 SNPs with a MAF>0.4. Finally, the selected SNPs are then tested against the SNPs of the genes defined in the file referenced with the genelist variable for epistasis.Results are shown user-friendly in the file epiinform.txt.

For example:

VARI3(bfile = "path/bfile", out = "path",
plink = "path/plink",varpl = "path/annotate_variation.pl",
db = "path/annovar/humandb/", genelist = "path/genefile.txt" )

If you specify genelist variable and primarylist variable

VARI3 automates the selection and analysis of the most promising SNPs of the genes defined in primarylist file to identify epistasis across the gene in genelist file.

VARI3 Steps:

First performs an analysis of the association of the SNPs of the genes defined in the file referenced with the primarylist variable to the phenotype. The analysis may include the selection of covariates by defining the covar and ncov variables. Then if clump variable is (clump = T) clumps to discard the SNPs in LD or not if (clump = F). Next performs the filtering of promising SNPs by enhancing the selection of SNPs with high MAFs. The usual limit of p.value in the GWAS for the selection of SNPs associated to a phenotype is 10^-8 but to promote the search of epistasis we lower this limit with the AsT variable to 10^-5 by default. This allows to select SNPs with a superior MAF in which we can find relevant interactions. We select the SNPs with the lowest p.value as follows: 5 SNPs with a MAF between 0.01-0.05, 5 SNPs with a MAF between 0.05-0.1, 20 SNPs with a MAF between 0.1-0.2, 20 SNPs with a MAF between 0.2-0.3, 20 SNPs with a MAF between 0.3-0.4 and 30 SNPs with a MAF>0.4. Finally, the selected SNPs are then tested against the SNPs of the genes defined in the file referenced with the genelist variable for epistasis. Results are shown user-friendly in the file epiinform.txt.

For example:

VARI3(bfile = "path/bfile", out = "path",
plink = "path/plink",varpl = "path/annotate_variation.pl",
db = "path/annovar/humandb/", genelist = "path/genefile.txt",
primarylist = "path/prigenelist.txt" )

How to see the VARI3 results

Results are shown in the file epiinform.txt. For example, we can see them in R:

epiinform = read_delim("epiinform.txt",
                       delim = " ", col_types = "cnnncccnnnccnnccc")
SNP CHISQ TBf Pepi SNP2 GEN1 GEN2 F_A1 ORl1 Pl1 LOC1 F_A2 Pl2 ORl2 LOC2
4:90666041 17.69 1.989e-06 2.607e-05 6:166956680 SNCA RPS6KA2 0.3926 1.2300 2.595e-09 intronic 0.38480 0.514300 1.0230 intronic

In this example we can see the 15 columns of epiinform file and one example of variant-variant interaction, we go to explain what mean each column:

  • SNP The SNP column refers to the SNP position in the genome, this SNP was selected by VARI3 to test epistasis.
  • CHISQ The CHISQ column refers to the Chi-squared results of this interaction.
  • TBf The TBf column refers to the Bonferroni's treshold of p.value from the interaction is significative.
  • Pepi The Pepi column refers to the p.value of the epistasis interaction. If Pepi is lower than TBf the interaction is significative.
  • SNP2 The SNP2 column refers to the SNP position in the genome, this SNP2 is the best epistasis SNP2 that interacts with the first SNP (SNP column).
  • GEN1 The GEN1 column refers to the gene symbol of the SNP in the SNP column.
  • GEN2 The GEN2 column refers to the gene symbol of the SNP in the SNP2 column.
  • F_A1 The F_A1 column refers to the MAF of the SNP in the SNP column.
  • ORl1 The ORl1 column refers to the odd ratio of the SNP in the SNP column.
  • Pl1 The Pl1 column refers to the p.value association risk of the SNP in the SNP column to the phenotype.
  • LOC1 The LOC1 column refers to the localitation in the gene of the SNP in the SNP column.
  • F_A2 The F_A2 column refers to the MAF of the SNP in the SNP2 column.
  • ORl2 The ORl2 column refers to the odd ratio of the SNP in the SNP2 column.
  • Pl2 The Pl2 column refers to the p.value association risk of the SNP in the SNP2 column to the phenotype.
  • LOC2 The LOC2 column refers to the localitation in the gene of the SNP in the SNP2 column.

How to use TLTO

TLTO automates the conversion of the two locus ratios from plink to a graph and a table with the odd ratios to better interpret the epistasis. To use the function it is necessary to have the following libraries installed: library(parallel), library(readr), library(data.table), library(dplyr), library(reshape2), library(tidyr) and library(ggplot2).
To use TLTO it is necessary to have plink and the genotype of the cohort in bfile format. To comment on the different variables that are configurable in TLTO, let's detail their function and use:

  • l1 The l1 variable is required, the user have to specify the SNPs position. For example, l1 = "5:141025162".

  • l2 The l2 variable is required, the user have to specify the SNPs position. For example, l2 = "4:90637601".

  • bfile The bfile variable is required, the user needs to define the path and name of the Plink bfile (bed, bim and fam) without the extension, the three files must be named in the same way. For example genotype.bed, genotype.bim, genotype.fam the variable bfile would be (bfile = "genotype").

  • out The out variable determines the output directory where all files generated by TLTO will be written.

  • core The core variable is by default core = detectCores() -1. It refers to the number of cores used.

  • plink The plink variable is by default "plink" but it is necessary to say the path of the previously downloaded plink file if it has not been added to the system executable path.

For example:

TLTO(l1 = "17:43472321",
     l2 = "10:26659553",
     bfile = "UNRELATED.SPAIN4.HARDCALLS.Rsq0.8", # bfile without extension
     out = "Desktop/")

Disclaimer

This function has developed by Alejandro Cisterna García as part of his PhD mentored by Juan Antonio Botía Blaya. This package is only available for IPDGC members, if others users want to use it please contact to [email protected]

About

Approach to VARIantVARIant interaction through VARIable thresholds and hypothesis testing. VARI3 automates the selection and analysis of the most promising SNPs to identify epistasis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages