Skip to content

vianeyBE/CassBio.GWAS_Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Cassava Bioinformatics Platform: GWAS Pipeline

Pipeline for Genome-Wide Association Studies (GWAS) in Cassava (Manihot esculenta)

Description

This repository contains a modular pipeline for performing genome-wide association studies (GWAS) in cassava (Manihot esculenta). It covers the workflow from genotype quality control, linkage disequilibrium analysis, population structure analysis, and GWAS analysis, to functional annotation of significant markers.

Authors

For questions or feedback about this pipeline, please contact:

Requirements

Technologies and tools

  • R (>= 4.1.0)
  • GAPIT3
  • VCFtools
  • BCFtools
  • PopLDdecay
  • adegenet, vegan, SNPRelate, plotly, tidyverse, etc.

Installation

git clone https://github.com/vianeyBE/cassava-gwas-pipeline.git

⚙️ Workflow Overview

| 1 | Quality control | ✅ Completed |

| 2 | LD decay analysis | 🔧 In progress |

| 3 | LD pruning | 🔧 In progress |

| 4 | Population structure analysis | ✅ Completed |

| 5 | GWAS - GAPIT3 | ✅ Completed |

| 6 | GWAS - EMMAX | 🔧 In progress |

| 7 | Functional annotation | 🔧 In progress |

| 8 | Marker validation boxplots | ✅ Completed |

1. Quality control

Description

A simple Bash script for filtering Variant Call Format (VCF) files using VCFtools. The script applies customizable quality control filters to genetic variant data. The script currently includes three different filtering strategies (commented and active), allowing users to select the most appropriate filter set for their project.

Arguments

  • input_vcf: Path to VCF file to filter.
  • output: Prefix to the output VCF filtered file.

Usage

bash 01_QC.sh <input_vcf> <output>

Example

bash 01_QC.sh raw_variants.vcf.gz filtered_variants

Dependencies

2. LD decay

Description

This script performs Linkage Disequilibrium (LD) Decay analysis using PopLDdecay. It supports two modes:

  • Whole genome: Calculates LD decay across the entire genome.
  • Per chromosome: Splits the VCF file by chromosome and calculates LD decay for each chromosome separately.

Arguments

  • dirIn: Directory containing the input VCF file.
  • vcfFile: Name of the input VCF file.
  • dirOut: Directory where output files will be saved.
  • prefix: Prefix for output files.
  • dist: Maximum distance (in base pairs) for pairwise LD calculation.
  • mode: Analysis mode — either ''whole_genome'' or ''by_chr''.

Usage

bash 02_LD_Decay.sh <dirIn> <vcfFile> <dirOut> <prefix> <dist> <mode>

Example

bash 02_LD_Decay.sh /path/to/vcf gs.vcf /path/to/output gs_2023 10000 whole_genome

Dependencies

3. LD pruning

Description

XXX

Arguments

  • XXX:

Usage

bash 

Example

bash 

Dependencies

  • XXX:

4. Population structure and covariable selection

This script contains the customized functions to perform four different multivariate methods: (1) Non-Metric Multidimensional Scaling (NDMS), (2) Multidimensional Scaling (MDS), (3) Principal Component Analysis (PCA), and (4) Discriminant Analysis of Principal Components (DAPC). This series of functions retrieves the required pacakges and data to perform four multivariate methods, potentially used to study population structure and select covariable for next analyses. Especifically, this script includes the functions for Non-Metric Multidimensional Scaling (NDMS), Multidimensional Scaling (MDS), Principal Component Analysis (PCA), and Discriminant Analysis of Principal Components (DAPC). The first two functions are built only for phenotypic data while the PCA can handle both data type (phenotypic or genotypic), and the DAPC only uses genotypic data. For all functions interactive plots (made using plotly package) in html format. The DAPC part is not built as a function give the nature of the functions used in which prompts are necessary to continue. Please run this methods line by line.

Arguments

  1. For NDMS:
  • dir: Directory where data is located.
  • phenofile: A database of genotypes/individuals (rows) and their traits (columns). First column must be genotypes/individuals names.
  • dist: Dissimilarity index/measure to use (default = "bray").
  • groups: Boolean value indicating whether the data includes different treatments/groups (default = F). If TRUE is selected then the second column of the phenofile must be the groups/treatments.
  1. For MDS:
  • dir: Directory where data is located.
  • phenofile: A database of genotypes/individuals (rows) and their traits (columns). First column must be genotypes/individuals names.
  • dist: Dissimilarity index/measure to use (default = "gower").
  • groups: Boolean value indicating whether the data includes different treatments/groups (default = F). If TRUE is selected then the second column of the phenofile must be the groups/treatments.
  1. For PCA:
  • dir: Directory where data is located.
  • type: A character string indicating wheter data is genotypic ("geno") of phenotypic ("pheno").
  • groups: Boolean value indicating whether the data includes different treatments/groups (default = F). If TRUE is selected then the second column of the phenofile must be the groups/treatments.
  • phenofile: A database of genotypes/individuals (rows) and their traits (columns). First column must be genotypes/individuals names.
  • genofile: An object of class SNPGDSFileClass (GDS file read with the snpgdsOpen function from SNPRelate package).
  • labels: When provide a genofile and groups argument is TRUE, please provide a dataframe with genotypes/individuals in the first column and groups/treatment data in the second column.
  • gds: A Genomic Data Structures (GDS) file (a reformatted VCF file with the snpgdsVCF2GDS function from vcfR package).
  • vcf: A Variant Call Format (VCF) file containing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations.
  • PC.retain: Boolean value indicating whether to analyze how many PCs retain (default = F).
  1. For DAPC:
  • my_genind: An object of class genind (vcf file read with the vcfR2genind function from vcfR package).

Usage

# NDMS, MDS, and PCA
NDMS(dir, phenofile, dist = "bray", groups = F)
MDS(dir, phenofile, dist = "gower", groups = F)
PCA(dir, type, groups = F, phenofile = NULL, labels = NULL, genofile = NULL, vcf = NULL, gds = NULL, PC.retain = F)

# DAPC
library(adegenet)
grp <- find.clusters(my_genind)
dapc <- dapc(my_genind, grp$grp)

Example

# NDMS, MDS, and PCA
NDMS(dir, phenofile, dist = "bray", groups = F)
MDS(dir, phenofile, dist = "gower", groups = F)
PCA(dir, type, groups = F, phenofile = NULL, labels = NULL, genofile = NULL, vcf = NULL, gds = NULL, PC.retain = F)

# DAPC
library(adegenet)
grp <- find.clusters(my_genind)
dapc <- dapc(my_genind, grp$grp)

Dependencies

  • vegan
  • tidyverse
  • plotly
  • htmlwidgets
  • SNPRelate
  • gdsfmt
  • psych
  • adegenet
  • grDevices
  • vcfR

5. GWAS: GAPIT

This R script runs a Genome-Wide Association Study (GWAS) analysis using the GAPIT3 R package. It saves the results of each trait in an individual folder. This function loads the required packages and data, then performs a GWAS analysis using the GAPIT function from the GAPIT3 package. It loops through each trait in trait_list, creating a new folder for each trait in the working directory and saving the results of the GWAS analysis for that trait in that folder.

Arguments

  • phenofile: A database of genotypes/individuals (rows) and the trait's measurements or BLUPs (columns).
  • genofile: Genotype data in hapmap format.
  • wdir: A working directory where the function will create the folders for each trait.
  • trait_list: A vector with the trait's name. The default is NULL, which will use the full phenotype dataset.

Usage

GAPIT3(phenofile, genofile, wdir, trait_list = NULL)

Examples

GAPIT3(phenofile = "my_phenotypes.csv", genofile = "my_genotypes.hmp", wdir = "my_results_folder", trait_list = c("Trait1", "Trait2", "Trait3"))

Dependencies

  • devtools
  • GAPIT3

6. GWAS: EMMAX

XXX

Arguments

  • XXX:

Usage

conda 

Examples

conda

Dependencies

  • XXX

6. Annotation of results

This code annotates the gen containing the significat SNPs from the GAPIT3 results. Additional, it retrieves the closest genes downstream and upstream.

Arguments

  • pat: Location path of the files.
  • wdyw: Gene feature to annotate.
  • Mdir: Name of the directory that contains the GAPIT3 results.
  • mod: Names of the models to filter (options: )
  • gff3: gff3 file from the genome version used for alignment.
  • annotationFile: Annotation details of the genes. txt file from the genome version used for alignment.
  • version: (Options: 6.1 or 8.1. Default = 6.1).

Usage

GWAS_Annotation(Mdir, pat, mod, wdyw, annot, GFF)

Examples

GWAS_Annotation(Mdir, pat, mod, wdyw, annot, GFF)

Dependencies

  • XXX

7. Boxplot: Genotype vs Phenotype

This R function generates a boxplot for a given SNP. The function takes as inputs a CSV file with the phenotype values and a list of SNPs. The function can also receive an optional CSV file to add extra information about the samples to the plot (categories, family, ect.) The output of the function is a PDF file with the plot. The function loads all the necessary packages and data files. It then checks for the presence of an optional file with sample labels and prepares the data for the boxplot. The function generates a boxplot for each SNP specified in the input CSV file. It uses ggplot2 package to generate the plot, with the genotypes on the x-axis and the phenotype on the y-axis. The function adds a label to the x-axis to indicate the trait being plotted. If labelfile is provided, it adds extra information about the samples, coloring the data by the levels in the provided file.

Arguments

  • outputname: (required) character string with the base name for the output file.
  • dir: (required) character string with the directory where the output file will be saved.
  • phenofile: (required) character string with the name of the phenotype file in tabular format. The first column should contain the sample names, and the rest of the columns should contain the phenotypes.
  • genofile: (required) character string with the name of the genotype file in hapmap format.
  • snp_list_file: (required) character string with the name of the CSV file with three columns:
    • Column 01 - Name: SNPS. List of SNPS to plot. The name should be the same as in the genofile data.
    • Column 02 - Name: trait. Name of the trait as in the phenofile data.
    • Column 03 - Name: xlabel. Name of the trait to be included as a label.
  • labelfile: (optional) character string with the name of the CSV file with two columns:
    • Column 01 - Name: Taxa. Sample names.
    • Column 02 - Name: label. Label or category to add to the plot.

Usage

GWAS_Boxplot(outputname, dir, phenofile, genofile, snp_list_file, labelfile = NULL)

Example

# Generate boxplot without extra labels
GWAS_Boxplot("outputname", ".path/to/save/plots/", "phenotype.csv", "genotype.hmp", "snp_list.csv")

# Generate boxplot with extra labels
GWAS_Boxplot("outputname", ".path/to/save/plots/", "phenotype.csv", "genotype.hmp", "snp_list.csv", "labelfile.csv")

Dependencies

  • tidyverse
  • tibble
  • dplyr
  • janitor
  • ggplot2
  • Biostrings
  • hrbrthemes
  • forcats
  • ggsignif
  • RColorBrewer

References

  • Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li, Twelve years of SAMtools and BCFtools, GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008

  • Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert E. Handsaker, Gerton Lunter, Gabor T. Marth, Stephen T. Sherry, Gilean McVean, Richard Durbin, 1000 Genomes Project Analysis Group, The variant call format and VCFtools, Bioinformatics, Volume 27, Issue 15, August 2011, Pages 2156–2158, https://doi.org/10.1093/bioinformatics/btr330

  • B L Browning, X Tian, Y Zhou, and S R Browning (2021) Fast two-stage phasing of large-scale sequence data. Am J Hum Genet 108(10):1880-1890. doi:10.1016/j.ajhg.2021.08.005

  • B L Browning, Y Zhou, and S R Browning (2018). A one-penny imputed genome from next generation reference panels. Am J Hum Genet 103(3):338-348. doi:10.1016/j.ajhg.2018.07.015

  • Chi Zhang, Shan-Shan Dong, Jun-Yang Xu, Wei-Ming He, Tie-Lin Yang, PopLDdecay: a fast and effective tool for linkage disequilibrium decay analysis based on variant call format files, Bioinformatics, Volume 35, Issue 10, May 2019, Pages 1786–1788, https://doi.org/10.1093/bioinformatics/bty875

  • Thibaut Jombart, adegenet: a R package for the multivariate analysis of genetic markers, Bioinformatics, Volume 24, Issue 11, June 2008, Pages 1403–1405, https://doi.org/10.1093/bioinformatics/btn129

  • Xiuwen Zheng, David Levine, Jess Shen, Stephanie M. Gogarten, Cathy Laurie, Bruce S. Weir, A high-performance computing toolset for relatedness and principal component analysis of SNP data, Bioinformatics, Volume 28, Issue 24, December 2012, Pages 3326–3328, https://doi.org/10.1093/bioinformatics/bts606

  • Wang J., Zhang Z., GAPIT Version 3: Boosting Power and Accuracy for Genomic Association and Prediction, Genomics, Proteomics & Bioinformatics (2021), doi: https://doi.org/10.1016/j.gpb.2021.08.005.

  • Huang M, Liu X, Zhou Y, Summers RM, Zhang Z. BLINK: A package for the next level of genome-wide association studies with both individuals and markers in the millions. Gigascience. https://doi.org/10.1093/gigascience/giy154.

  • Liu X., Huang M., Fan B., Buckler E. S., Zhang Z., 2016 Iterative Usage of Fixed and Random Effect Models for Powerful and Efficient Genome-Wide Association Studies. PLoS Genet. 12: e1005767. https://doi.org/10.1371/journal.pgen.1005767.

  • David M. Goodstein, Shengqiang Shu, Russell Howson, Rochak Neupane, Richard D. Hayes, Joni Fazo, Therese Mitros, William Dirks, Uffe Hellsten, Nicholas Putnam, and Daniel S. Rokhsar, Phytozome: a comparative platform for green plant genomics, Nucleic Acids Res. 2012 40 (D1): D1178-D1186

  • R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

About

Pipeline for Genome-wide association studies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •