Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
salvidm committed Dec 15, 2023
1 parent 1935820 commit 14b8f1e
Show file tree
Hide file tree
Showing 6 changed files with 22 additions and 20 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
[![nf-gwas](https://github.com/genepi/nf-gwas/actions/workflows/ci-tests.yml/badge.svg)](https://github.com/genepi/nf-gwas/actions/workflows/ci-tests.yml)
[![nf-test](https://img.shields.io/badge/tested_with-nf--test-337ab7.svg)](https://github.com/askimed/nf-test)

**nf-gwas** is a Nextflow pipeline to run biobank-scale genome-wide association studies (GWAS) analysis. The pipeline automatically performs numerous pre- and post-processing steps, integrates regression modeling from the REGENIE package and currently supports single-variant, gene-based and interaction testing. All modules are structured in sub-workflows which allows to extend te pipeline to other methods and tools in future. nf-gwas includes an extensive reporting functionality that allows to inspect thousands of phenotypes and navigate interactive Manhattan plots directly in the web browser.
**nf-gwas** is a Nextflow pipeline to run biobank-scale genome-wide association studies (GWAS) analysis. The pipeline automatically performs numerous pre- and post-processing steps, integrates regression modeling from the REGENIE package and currently supports single-variant, gene-based and interaction testing. All modules are structured in sub-workflows which allows to extend the pipeline to other methods and tools in future. nf-gwas includes an extensive reporting functionality that allows to inspect thousands of phenotypes and navigate interactive Manhattan plots directly in the web browser.

The pipeline is tested using the unit-style testing framework [nf-test](https://github.com/askimed/nf-test) and includes a [schema definition](nextflow_schema.json) to run with **Nextflow Tower**.

Expand Down
14 changes: 8 additions & 6 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,18 @@ nav_order: 2

## Getting Started

1. Install [Nextflow](https://www.nextflow.io/docs/latest/getstarted.html#installation) (>=21.04.0).
**Windows users**: this [step-by-step](https://www.nextflow.io/blog/2021/setup-nextflow-on-windows.html) tutorial could make your life much easier.
1. Install [Nextflow](https://www.nextflow.io/docs/latest/getstarted.html#installation) (>=21.04.0)

2. Install [Docker](https://docs.docker.com/get-docker/) or [Singularity](https://sylabs.io/).
2. Install [Docker](https://docs.docker.com/get-docker/) or [Singularity](https://sylabs.io/).

3. Run the pipeline on a test dataset using Docker to validate your installation.
**Note** for *Windows users*: check out this [step-by-step tutorial](https://www.nextflow.io/blog/2021/setup-nextflow-on-windows.html) to set up Nextflow on your local machine.

3. Run the pipeline on a test dataset to validate your installation. If you specify the option -profile docker, Nextflow will automatically find it in our repository.

```
nextflow run genepi/nf-gwas -r v<[latest tag](https://github.com/genepi/nf-gwas/tags)> -profile test,<docker,singularity>
```
**Note:** Following the [latest tag](https://github.com/genepi/nf-gwas/tags) link, you will be redirected to the list of the pipeline releases. You can specify the latest tag in the command above e.g. (-r v1.0.0 )
### Run the pipeline on your data
Expand All @@ -37,9 +39,9 @@ nav_order: 2
```
2. Run the pipeline with your configuration file
2. Run the pipeline on your data with your configuration file
```
nextflow run genepi/nf-gwas -c project.config -r v1.0.0 -profile <docker,singularity>
nextflow run genepi/nf-gwas -c project.config -r v<[latest tag](https://github.com/genepi/nf-gwas/tags)> -profile <docker,singularity>
```
**Note:** The slurm profiles require that (a) singularity is installed on all nodes and (b) a shared file system path as a working directory.
6 changes: 3 additions & 3 deletions docs/gwas-regenie-101/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ nav_order: 1

## Introduction

Programs to perform genome-wide association studies (GWAS) are usually run via the command line. This can be intimidating for a biologist. Take me as an example: In my bachelor and master I've studied molecular medicine. So my formal training focused on understanding pathophysiological processes in the human body and how to perform wet-lab experiments, I never had to use the command line. Nevertheless, I recently ran my first GWAS using the nf-gwas pipeline.
Programs to perform genome-wide association studies (GWAS) are usually run via the command line. This can be intimidating for a biologist. Take me as an example: in my bachelor and master I've studied molecular medicine. So my formal training focused on understanding pathophysiological processes in the human body and how to perform wet-lab experiments, I never had to use the command line. Nevertheless, I was able to run my first GWAS using the nf-gwas pipeline.

Here, I want to first introduce this pipeline through the lens of a biologist and second share with you *my setup*.
Since I am working on a Windows computer, I need to access a remote Linux server to run the pipeline. So the first section will be about the kind of tasks that are *so basic that bioinformaticians don't even talk about them*. I guess this is like describing how to pipet for a trained wet-lab biologist.
Here, I want to first introduce this pipeline through the lens of a biologist (see section [Pipeline Overview](https://genepi.github.io/nf-gwas/gwas-regenie-101/pipeline-overview.html)) and second share with you *my setup*.
Since I am working on a Windows computer, I need to access a remote Linux server to run the pipeline. So the section [Mastering the basic tasks](https://genepi.github.io/nf-gwas/gwas-regenie-101/basic-tasks.html) will be about the kind of tasks that are *so basic that bioinformaticians don't even talk about them*. I guess this is like describing how to pipet for a trained wet-lab biologist.

However, I hope it will show you that if you follow these steps you can run your first GWAS without any prior knowledge in bioinformatics in no time :).
12 changes: 6 additions & 6 deletions docs/gwas-regenie-101/pipeline-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,16 @@ nav_order: 2

## Pipeline Overview

The nf-gwas pipeline performs whole genome regression modeling using [regenie](https://github.com/rgcgithub/regenie). For profound details on regenie, I suggest to read [the paper by Mbatchou et al.](https://doi.org/10.1038/s41588-021-00870-7) but it can be used for quantitative and binary traits and first builds regression models according to the leave-one-chromosome-out (LOCO) scheme that are then used in the second step (which tests the association of each SNP with the phenotype) as covariates. The advantage is that it is computationally efficient and fast meaning that it can also be used on very large datasets such as UK Biobank.
The nf-gwas pipeline performs whole genome regression modeling using [regenie](https://github.com/rgcgithub/regenie). For a deep understanding of regenie, I suggest to read [the paper by Mbatchou et al.](https://doi.org/10.1038/s41588-021-00870-7). In brief, regenie can be used for quantitative and binary traits and that it first builds regression models according to the leave-one-chromosome-out (LOCO) scheme. These models are then used as covariates in the second step, which tests the association of each SNP with the phenotype. The advantage is that it is computationally efficient and fast, meaning that it can also be used on very large datasets such as UK Biobank.

### Error-prone data preparation steps are performed by the pipeline

However, before you actually perform a GWAS, you need to properly prepare your data including converting file formats, filtering data and correct preparation of phenotypes and covariates. These steps are tedious and prone to error - and can also be very time consuming if it's your first time working with command line programs.
Luckily, the GWAS pipeline presented here does some of the work for you and summarizes these preparation steps in the end in a report file:
However, before you actually perform a GWAS, you need to properly prepare your data. This included converting file formats, filtering data and preparing phenotypes and covariates files. These steps are tedious and prone to error - and can also be very time consuming if it's your first time working with command line programs.
Luckily, the GWAS pipeline presented here does some of the work for you and summarizes these preparation steps in a report file:

1. It validates the phenotype and (optional) covariate files that you prepared.
2. For step 1 regenie developers recommend to use directly genotyped variants that have passed quality control (QC). The pipeline performs the QC for you, based on minor allele frequency and count, genotype missingness, Hardy-Weinberg equilibrium and sample missingness. In addition, the regenie developers do not recommend to use >1M SNPs for step 1. Therefore, the pipeline can additionally perform pruning before step 1 of regenie is run. By default, certain QC thresholds are set and pruning is disabled but of course you can adapt the QC thresholds and pruning settings.
3. In step 2 all available genotypes should be used. If you have for example imputed your data with the Michigan Imputation Server, it is in the VCF format, that is not supported by regenie. The pipeline can convert your VCF imputed data into the correct file format. In addition, you can also set a threshold for the imputation score and the minor allele count for the imputed variants that are included in step 2.
2. For step 1, regenie developers recommend to use directly genotyped variants that have passed quality control (QC). The pipeline performs the QC for you, based on minor allele frequency and count, genotype missingness, Hardy-Weinberg equilibrium and sample missingness. In addition, the regenie developers do not recommend to use >1M SNPs for step 1. Therefore, the pipeline can additionally perform pruning before regenie step 1 is run. By default, certain QC thresholds are set and pruning is disabled but of course you can adapt the QC thresholds and pruning settings.
3. In step 2, all available genotypes should be used. For example, if you imputed your data with the Michigan Imputation Server, it is in the VCF format, which is not supported by regenie. The pipeline can convert your VCF imputed data into the required file format. In addition, you can also set a threshold for the imputation score and the minor allele count for the imputed variants that are included in step 2.

### The pipeline automatically creates Manhattan and QQ plots and annotates your results

Expand All @@ -26,4 +26,4 @@ In addition to performing these data preparation steps, the pipeline also perfor
2. Regenie gives you the GWAS summary statistics as a large file with the ending *.regenie.gz*. If your computer does not have so much RAM, loading this file for example into R to perform some further analyses can take quite long. The pipeline additionally outputs you a file with the ending *.filtered.annotated.txt.gz*. This file is much smaller because it only contains the summary SNPs filtered for a minimum ‑log<sub>10</sub>(P) (by default =5) and in addition the nearest genes have been annotated to these SNPs.

### Run pipeline with Nextflow
And last but not least it is also important to mention that this pipeline is built with the workflow manager [Nextflow](https://www.nextflow.io/). To use the pipeline, you don't need to know how it works let alone build one on your own but I think one important advantage is helpful to know: The software that is needed for all the steps described above is downloaded when the pipeline is initiated and the software is stored in a *container*. So no matter if you perform the data analysis on different computers or if you need to rerun it in two years: As long as you use the same input data and the same configuration of the pipeline, you always get the same results. This further increases reproducibility and saves a lot of time if you suddenly need to work with a different computer or server.
Last but not least, it is also important to mention that this pipeline is built with the workflow manager [Nextflow](https://www.nextflow.io/). To use the pipeline, you don't need to know how it works let alone build one on your own, but there is one main advantage that I think is helpful to know: all the softwares that are used in the steps described above are stored in a *container* and are downloaded when the pipeline is initiated. So no matter if you perform the data analysis on different computers or if you need to rerun it in two years, as long as you use the same input data and the same configuration of the pipeline, you always get the same results. This further increases reproducibility and saves a lot of time if you suddenly need to work with a different computer or server.
2 changes: 1 addition & 1 deletion docs/gwas-regenie-101/run-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ nav_order: 3

### Running the nf-gwas pipeline

To run the pipeline on your data, prepare the phenotype and (optional) covariate files as described [here](https://rgcgithub.github.io/regenie/options/#input)). In addition, you need the genotyping data for step 1 in bim,bed,fam format and your imputed genotypes in VCF or BGEN format. Transfer all these files using FileZilla to the folder of your choice on the server.
To run the pipeline on your data, prepare the phenotype and (optional) covariate files as described [here](https://rgcgithub.github.io/regenie/options/#input)). In addition, you need the genotyping data for step 1 in bim,bed,fam format and your imputed genotypes in VCF or BGEN format. Transfer all these files using FileZilla to the folder of your choice on the server.

Now, you we need to prepare a configuration file for the pipeline. You can use any text editor! For example, we use the IDE [Visual Studio Code](https://code.visualstudio.com/), which has some very convenient features, including highlighting different code elements. The required and optional parameters for the configuration file are all listed [here](../params/params) of the pipeline. To make your own config file, it is the easiest to copy one of the exemplary [config files](https://github.com/genepi/nf-gwas/blob/main/conf/test.config). Adapt all the paths and parameters to fit your data and save the file (e.g. as: first-gwas.config). If you´ve used additional parameters, just make sure that they are within the curly brackets.

Expand Down
6 changes: 3 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ A Nextflow pipeline to perform genome-wide association studies (GWAS).
This cloud-ready GWAS pipeline allows you to run **single variant tests**, **gene-based tests** and **interaction testing** using [REGENIE](https://github.com/rgcgithub/regenie) in an automated and reproducible way.

For single variant tests, the pipeline works with BGEN (e.g. from UK Biobank) or VCF files (e.g. from [Michigan Imputation Server](https://imputationserver.sph.umich.edu/)). For gene-based tests, we currently support BED files as an input.
The pipeline outputs association results (tabixed, works with e.g. LocusZoom out of the box), annotated loci tophits and an interactive HTML report provding statistics and plots.
The output files of the pipeline include results of the association tests (in tabix indexed format, which works with e.g. LocusZoom out of the box), annotated loci tophits and an interactive HTML report with summary statistics and plots.

The single-variant pipeline currently includes the following steps:

Expand All @@ -33,9 +33,9 @@ The single-variant pipeline currently includes the following steps:

3. Prune micro-array data using [plink2](https://www.cog-genomics.org/plink/2.0/) (optional).

4. Filter micro-array data using plink2 based on MAF, MAC, HWE, genotype missingess and sample missingness.
4. Filter micro-array data using plink2 based on MAF, MAC, HWE, genotype missingness and sample missingness.

5. Run [regenie](https://github.com/rgcgithub/regenie) and tabix results to use with LocusZoom.
5. Run [regenie](https://github.com/rgcgithub/regenie) and index (tabix) results to use with LocusZoom.

6. Parse regenie log and create summary statistics.

Expand Down

0 comments on commit 14b8f1e

Please sign in to comment.