Cassava Bioinformatics Platform: GWAS Pipeline

Pipeline for Genome-Wide Association Studies (GWAS) in Cassava (Manihot esculenta)

Description

This repository contains a modular pipeline for performing genome-wide association studies (GWAS) in cassava (Manihot esculenta). It covers the workflow from genotype quality control, linkage disequilibrium analysis, population structure analysis, and GWAS analysis, to functional annotation of significant markers.

Authors

For questions or feedback about this pipeline, please contact:

Vianey Barrera-Enriquez - [CGIAR]: [email protected]
Camilo E. Sanchez - [CGIAR]: [email protected]

Requirements

Technologies and tools

R (>= 4.1.0)
GAPIT3
VCFtools
BCFtools
PopLDdecay
adegenet, vegan, SNPRelate, plotly, tidyverse, etc.

Installation

git clone https://github.com/vianeyBE/cassava-gwas-pipeline.git

⚙️ Workflow Overview

| 1 | Quality control | ✅ Completed |

| 2 | LD decay analysis | 🔧 In progress |

| 3 | LD pruning | 🔧 In progress |

| 4 | Population structure analysis | ✅ Completed |

| 5 | GWAS - GAPIT3 | ✅ Completed |

| 6 | GWAS - EMMAX | 🔧 In progress |

| 7 | Functional annotation | 🔧 In progress |

| 8 | Marker validation boxplots | ✅ Completed |

1. Quality control

Description

A simple Bash script for filtering Variant Call Format (VCF) files using VCFtools. The script applies customizable quality control filters to genetic variant data. The script currently includes three different filtering strategies (commented and active), allowing users to select the most appropriate filter set for their project.

Arguments

input_vcf: Path to VCF file to filter.
output: Prefix to the output VCF filtered file.

Usage

bash 01_QC.sh <input_vcf> <output>

Example

bash 01_QC.sh raw_variants.vcf.gz filtered_variants

Dependencies

VCFtools: https://vcftools.github.io/

2. LD decay

Description

This script performs Linkage Disequilibrium (LD) Decay analysis using PopLDdecay. It supports two modes:

Whole genome: Calculates LD decay across the entire genome.
Per chromosome: Splits the VCF file by chromosome and calculates LD decay for each chromosome separately.

Arguments

dirIn: Directory containing the input VCF file.
vcfFile: Name of the input VCF file.
dirOut: Directory where output files will be saved.
prefix: Prefix for output files.
dist: Maximum distance (in base pairs) for pairwise LD calculation.
mode: Analysis mode — either ''whole_genome'' or ''by_chr''.

Usage

bash 02_LD_Decay.sh <dirIn> <vcfFile> <dirOut> <prefix> <dist> <mode>

Example

bash 02_LD_Decay.sh /path/to/vcf gs.vcf /path/to/output gs_2023 10000 whole_genome

Dependencies

PopLDdecay: https://github.com/BGI-shenzhen/PopLDdecay
vcftools: https://vcftools.github.io/
perl: For the PopLDdecay plotting script

3. LD pruning

Description

XXX

Arguments

XXX:

Usage

bash

Example

bash

Dependencies

XXX:

4. Population structure and covariable selection

This script contains the customized functions to perform four different multivariate methods: (1) Non-Metric Multidimensional Scaling (NDMS), (2) Multidimensional Scaling (MDS), (3) Principal Component Analysis (PCA), and (4) Discriminant Analysis of Principal Components (DAPC). This series of functions retrieves the required pacakges and data to perform four multivariate methods, potentially used to study population structure and select covariable for next analyses. Especifically, this script includes the functions for Non-Metric Multidimensional Scaling (NDMS), Multidimensional Scaling (MDS), Principal Component Analysis (PCA), and Discriminant Analysis of Principal Components (DAPC). The first two functions are built only for phenotypic data while the PCA can handle both data type (phenotypic or genotypic), and the DAPC only uses genotypic data. For all functions interactive plots (made using plotly package) in html format. The DAPC part is not built as a function give the nature of the functions used in which prompts are necessary to continue. Please run this methods line by line.

Arguments

For NDMS:

dir: Directory where data is located.
phenofile: A database of genotypes/individuals (rows) and their traits (columns). First column must be genotypes/individuals names.
dist: Dissimilarity index/measure to use (default = "bray").
groups: Boolean value indicating whether the data includes different treatments/groups (default = F). If TRUE is selected then the second column of the phenofile must be the groups/treatments.

For MDS:

dir: Directory where data is located.
phenofile: A database of genotypes/individuals (rows) and their traits (columns). First column must be genotypes/individuals names.
dist: Dissimilarity index/measure to use (default = "gower").
groups: Boolean value indicating whether the data includes different treatments/groups (default = F). If TRUE is selected then the second column of the phenofile must be the groups/treatments.

For PCA:

dir: Directory where data is located.
type: A character string indicating wheter data is genotypic ("geno") of phenotypic ("pheno").
groups: Boolean value indicating whether the data includes different treatments/groups (default = F). If TRUE is selected then the second column of the phenofile must be the groups/treatments.
phenofile: A database of genotypes/individuals (rows) and their traits (columns). First column must be genotypes/individuals names.
genofile: An object of class SNPGDSFileClass (GDS file read with the snpgdsOpen function from SNPRelate package).
labels: When provide a genofile and groups argument is TRUE, please provide a dataframe with genotypes/individuals in the first column and groups/treatment data in the second column.
gds: A Genomic Data Structures (GDS) file (a reformatted VCF file with the snpgdsVCF2GDS function from vcfR package).
vcf: A Variant Call Format (VCF) file containing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations.
PC.retain: Boolean value indicating whether to analyze how many PCs retain (default = F).

For DAPC:

my_genind: An object of class genind (vcf file read with the vcfR2genind function from vcfR package).

Usage

# NDMS, MDS, and PCA
NDMS(dir, phenofile, dist = "bray", groups = F)
MDS(dir, phenofile, dist = "gower", groups = F)
PCA(dir, type, groups = F, phenofile = NULL, labels = NULL, genofile = NULL, vcf = NULL, gds = NULL, PC.retain = F)

# DAPC
library(adegenet)
grp <- find.clusters(my_genind)
dapc <- dapc(my_genind, grp$grp)

Example

# NDMS, MDS, and PCA
NDMS(dir, phenofile, dist = "bray", groups = F)
MDS(dir, phenofile, dist = "gower", groups = F)
PCA(dir, type, groups = F, phenofile = NULL, labels = NULL, genofile = NULL, vcf = NULL, gds = NULL, PC.retain = F)

# DAPC
library(adegenet)
grp <- find.clusters(my_genind)
dapc <- dapc(my_genind, grp$grp)

Dependencies

vegan
tidyverse
plotly
htmlwidgets
SNPRelate
gdsfmt
psych
adegenet
grDevices
vcfR

5. GWAS: GAPIT

This R script runs a Genome-Wide Association Study (GWAS) analysis using the GAPIT3 R package. It saves the results of each trait in an individual folder. This function loads the required packages and data, then performs a GWAS analysis using the GAPIT function from the GAPIT3 package. It loops through each trait in trait_list, creating a new folder for each trait in the working directory and saving the results of the GWAS analysis for that trait in that folder.

Arguments

phenofile: A database of genotypes/individuals (rows) and the trait's measurements or BLUPs (columns).
genofile: Genotype data in hapmap format.
wdir: A working directory where the function will create the folders for each trait.
trait_list: A vector with the trait's name. The default is NULL, which will use the full phenotype dataset.

Usage

GAPIT3(phenofile, genofile, wdir, trait_list = NULL)

Examples

GAPIT3(phenofile = "my_phenotypes.csv", genofile = "my_genotypes.hmp", wdir = "my_results_folder", trait_list = c("Trait1", "Trait2", "Trait3"))

Dependencies

devtools
GAPIT3

6. GWAS: EMMAX

XXX

Arguments

XXX:

Usage

conda

Examples

conda

Dependencies

XXX

6. Annotation of results

This code annotates the gen containing the significat SNPs from the GAPIT3 results. Additional, it retrieves the closest genes downstream and upstream.

Arguments

pat: Location path of the files.
wdyw: Gene feature to annotate.
Mdir: Name of the directory that contains the GAPIT3 results.
mod: Names of the models to filter (options: )
gff3: gff3 file from the genome version used for alignment.
annotationFile: Annotation details of the genes. txt file from the genome version used for alignment.
version: (Options: 6.1 or 8.1. Default = 6.1).

Usage

GWAS_Annotation(Mdir, pat, mod, wdyw, annot, GFF)

Examples

GWAS_Annotation(Mdir, pat, mod, wdyw, annot, GFF)

Dependencies

XXX

7. Boxplot: Genotype vs Phenotype

This R function generates a boxplot for a given SNP. The function takes as inputs a CSV file with the phenotype values and a list of SNPs. The function can also receive an optional CSV file to add extra information about the samples to the plot (categories, family, ect.) The output of the function is a PDF file with the plot. The function loads all the necessary packages and data files. It then checks for the presence of an optional file with sample labels and prepares the data for the boxplot. The function generates a boxplot for each SNP specified in the input CSV file. It uses ggplot2 package to generate the plot, with the genotypes on the x-axis and the phenotype on the y-axis. The function adds a label to the x-axis to indicate the trait being plotted. If labelfile is provided, it adds extra information about the samples, coloring the data by the levels in the provided file.

Arguments

outputname: (required) character string with the base name for the output file.
dir: (required) character string with the directory where the output file will be saved.
phenofile: (required) character string with the name of the phenotype file in tabular format. The first column should contain the sample names, and the rest of the columns should contain the phenotypes.
genofile: (required) character string with the name of the genotype file in hapmap format.
snp_list_file: (required) character string with the name of the CSV file with three columns:
- Column 01 - Name: SNPS. List of SNPS to plot. The name should be the same as in the genofile data.
- Column 02 - Name: trait. Name of the trait as in the phenofile data.
- Column 03 - Name: xlabel. Name of the trait to be included as a label.
labelfile: (optional) character string with the name of the CSV file with two columns:
- Column 01 - Name: Taxa. Sample names.
- Column 02 - Name: label. Label or category to add to the plot.

Usage

GWAS_Boxplot(outputname, dir, phenofile, genofile, snp_list_file, labelfile = NULL)

Example

# Generate boxplot without extra labels
GWAS_Boxplot("outputname", ".path/to/save/plots/", "phenotype.csv", "genotype.hmp", "snp_list.csv")

# Generate boxplot with extra labels
GWAS_Boxplot("outputname", ".path/to/save/plots/", "phenotype.csv", "genotype.hmp", "snp_list.csv", "labelfile.csv")

Dependencies

tidyverse
tibble
dplyr
janitor
ggplot2
Biostrings
hrbrthemes
forcats
ggsignif
RColorBrewer

References

Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li, Twelve years of SAMtools and BCFtools, GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008
Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert E. Handsaker, Gerton Lunter, Gabor T. Marth, Stephen T. Sherry, Gilean McVean, Richard Durbin, 1000 Genomes Project Analysis Group, The variant call format and VCFtools, Bioinformatics, Volume 27, Issue 15, August 2011, Pages 2156–2158, https://doi.org/10.1093/bioinformatics/btr330
B L Browning, X Tian, Y Zhou, and S R Browning (2021) Fast two-stage phasing of large-scale sequence data. Am J Hum Genet 108(10):1880-1890. doi:10.1016/j.ajhg.2021.08.005
B L Browning, Y Zhou, and S R Browning (2018). A one-penny imputed genome from next generation reference panels. Am J Hum Genet 103(3):338-348. doi:10.1016/j.ajhg.2018.07.015
Chi Zhang, Shan-Shan Dong, Jun-Yang Xu, Wei-Ming He, Tie-Lin Yang, PopLDdecay: a fast and effective tool for linkage disequilibrium decay analysis based on variant call format files, Bioinformatics, Volume 35, Issue 10, May 2019, Pages 1786–1788, https://doi.org/10.1093/bioinformatics/bty875
Thibaut Jombart, adegenet: a R package for the multivariate analysis of genetic markers, Bioinformatics, Volume 24, Issue 11, June 2008, Pages 1403–1405, https://doi.org/10.1093/bioinformatics/btn129
Xiuwen Zheng, David Levine, Jess Shen, Stephanie M. Gogarten, Cathy Laurie, Bruce S. Weir, A high-performance computing toolset for relatedness and principal component analysis of SNP data, Bioinformatics, Volume 28, Issue 24, December 2012, Pages 3326–3328, https://doi.org/10.1093/bioinformatics/bts606
Wang J., Zhang Z., GAPIT Version 3: Boosting Power and Accuracy for Genomic Association and Prediction, Genomics, Proteomics & Bioinformatics (2021), doi: https://doi.org/10.1016/j.gpb.2021.08.005.
Huang M, Liu X, Zhou Y, Summers RM, Zhang Z. BLINK: A package for the next level of genome-wide association studies with both individuals and markers in the millions. Gigascience. https://doi.org/10.1093/gigascience/giy154.
Liu X., Huang M., Fan B., Buckler E. S., Zhang Z., 2016 Iterative Usage of Fixed and Random Effect Models for Powerful and Efficient Genome-Wide Association Studies. PLoS Genet. 12: e1005767. https://doi.org/10.1371/journal.pgen.1005767.
David M. Goodstein, Shengqiang Shu, Russell Howson, Rochak Neupane, Richard D. Hayes, Joni Fazo, Therese Mitros, William Dirks, Uffe Hellsten, Nicholas Putnam, and Daniel S. Rokhsar, Phytozome: a comparative platform for green plant genomics, Nucleic Acids Res. 2012 40 (D1): D1178-D1186
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
Scripts		Scripts
GWAS_Notebook.Rmd		GWAS_Notebook.Rmd
GWAS_Notebook.nb.html		GWAS_Notebook.nb.html
README.md		README.md

vianeyBE/CassBio.GWAS_Pipeline

Folders and files

Latest commit

History

Repository files navigation

Cassava Bioinformatics Platform: GWAS Pipeline

Description

Authors

Requirements

Technologies and tools

Installation

⚙️ Workflow Overview

1. Quality control

Description

Arguments

Usage

Example

Dependencies

2. LD decay

Description

Arguments

Usage

Example

Dependencies

3. LD pruning

Description

Arguments

Usage

Example

Dependencies

4. Population structure and covariable selection

Arguments

Usage

Example

Dependencies

5. GWAS: GAPIT

Arguments

Usage

Examples

Dependencies

6. GWAS: EMMAX

Arguments

Usage

Examples

Dependencies

6. Annotation of results

Arguments

Usage

Examples

Dependencies

7. Boxplot: Genotype vs Phenotype

Arguments

Usage

Example

Dependencies

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages