Skip to content

Commit

Permalink
Updated documentation (#94)
Browse files Browse the repository at this point in the history
* getting started and run pipeline

* Update documentation

* Fix link to tags

---------

Co-authored-by: Sebastian Schoenherr <[email protected]>
  • Loading branch information
salvidm and seppinho authored Jan 30, 2024
1 parent 3960887 commit 9a73387
Show file tree
Hide file tree
Showing 6 changed files with 24 additions and 21 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
[![nf-gwas](https://github.com/genepi/nf-gwas/actions/workflows/ci-tests.yml/badge.svg)](https://github.com/genepi/nf-gwas/actions/workflows/ci-tests.yml)
[![nf-test](https://img.shields.io/badge/tested_with-nf--test-337ab7.svg)](https://github.com/askimed/nf-test)

**nf-gwas** is a Nextflow pipeline to run biobank-scale genome-wide association studies (GWAS) analysis. The pipeline automatically performs numerous pre- and post-processing steps, integrates regression modeling from the REGENIE package and currently supports single-variant, gene-based and interaction testing. All modules are structured in sub-workflows which allows to extend te pipeline to other methods and tools in future. nf-gwas includes an extensive reporting functionality that allows to inspect thousands of phenotypes and navigate interactive Manhattan plots directly in the web browser.
**nf-gwas** is a Nextflow pipeline to run biobank-scale genome-wide association studies (GWAS) analysis. The pipeline automatically performs numerous pre- and post-processing steps, integrates regression modeling from the REGENIE package and currently supports single-variant, gene-based and interaction testing. All modules are structured in sub-workflows which allows to extend the pipeline to other methods and tools in future. nf-gwas includes an extensive reporting functionality that allows to inspect thousands of phenotypes and navigate interactive Manhattan plots directly in the web browser.

The pipeline is tested using the unit-style testing framework [nf-test](https://github.com/askimed/nf-test) and includes a [schema definition](nextflow_schema.json) to run with **Nextflow Tower**.

Expand Down
15 changes: 9 additions & 6 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,18 @@ nav_order: 2

## Getting Started

1. Install [Nextflow](https://www.nextflow.io/docs/latest/getstarted.html#installation) (>=21.04.0).
1. Install [Nextflow](https://www.nextflow.io/docs/latest/getstarted.html#installation) (>=22.10.4)

2. Install [Docker](https://docs.docker.com/get-docker/) or [Singularity](https://sylabs.io/).
2. Install [Docker](https://docs.docker.com/get-docker/) or [Singularity](https://sylabs.io/).

3. Run the pipeline on a test dataset using Docker to validate your installation.
**Note for Windows users**: This [step-by-step tutorial](https://www.nextflow.io/blog/2021/setup-nextflow-on-windows.html) helps you to set up Nextflow on your local machine.

3. Run the pipeline on a test dataset to validate your installation.

```
nextflow run genepi/nf-gwas -r v1.0.0 -profile test,<docker,singularity>
nextflow run genepi/nf-gwas -r <latest-tag> -profile test,docker
```
**Note:** Click [here](https://github.com/genepi/nf-gwas/tags) to replace the `<latest-tag>` with the actual version you want to run (e.g. `-r v1.0.0`).
### Run the pipeline on your data
Expand All @@ -36,9 +39,9 @@ nav_order: 2
```
2. Run the pipeline with your configuration file
2. Run the pipeline on your data with your configuration file
```
nextflow run genepi/nf-gwas -c project.config -r v1.0.0 -profile <docker,singularity>
nextflow run genepi/nf-gwas -c project.config -r v<[latest tag](https://github.com/genepi/nf-gwas/tags)> -profile <docker,singularity>
```
**Note:** The slurm profiles require that (a) singularity is installed on all nodes and (b) a shared file system path as a working directory.
6 changes: 3 additions & 3 deletions docs/gwas-regenie-101/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ nav_order: 1

## Introduction

Programs to perform genome-wide association studies (GWAS) are usually run via the command line. This can be intimidating for a biologist. Take me as an example: In my bachelor and master I've studied molecular medicine. So my formal training focused on understanding pathophysiological processes in the human body and how to perform wet-lab experiments, I never had to use the command line. Nevertheless, I recently ran my first GWAS using the nf-gwas pipeline.
Programs to perform genome-wide association studies (GWAS) are usually run via the command line. This can be intimidating for a biologist. Take me as an example: in my bachelor and master I've studied molecular medicine. So my formal training focused on understanding pathophysiological processes in the human body and how to perform wet-lab experiments, I never had to use the command line. Nevertheless, I was able to run my first GWAS using the nf-gwas pipeline.

Here, I want to first introduce this pipeline through the lens of a biologist and second share with you *my setup*.
Since I am working on a Windows computer, I need to access a remote Linux server to run the pipeline. So the first section will be about the kind of tasks that are *so basic that bioinformaticians don't even talk about them*. I guess this is like describing how to pipet for a trained wet-lab biologist.
Here, I want to first introduce this pipeline through the lens of a biologist (see section [Pipeline Overview](https://genepi.github.io/nf-gwas/gwas-regenie-101/pipeline-overview.html)) and second share with you *my setup*.
Since I am working on a Windows computer, I need to access a remote Linux server to run the pipeline. So the section [Mastering the basic tasks](https://genepi.github.io/nf-gwas/gwas-regenie-101/basic-tasks.html) will be about the kind of tasks that are *so basic that bioinformaticians don't even talk about them*. I guess this is like describing how to pipet for a trained wet-lab biologist.

However, I hope it will show you that if you follow these steps you can run your first GWAS without any prior knowledge in bioinformatics in no time :).
12 changes: 6 additions & 6 deletions docs/gwas-regenie-101/pipeline-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,16 @@ nav_order: 2

## Pipeline Overview

The nf-gwas pipeline performs whole genome regression modeling using [regenie](https://github.com/rgcgithub/regenie). For profound details on regenie, I suggest to read [the paper by Mbatchou et al.](https://doi.org/10.1038/s41588-021-00870-7) but it can be used for quantitative and binary traits and first builds regression models according to the leave-one-chromosome-out (LOCO) scheme that are then used in the second step (which tests the association of each SNP with the phenotype) as covariates. The advantage is that it is computationally efficient and fast meaning that it can also be used on very large datasets such as UK Biobank.
The nf-gwas pipeline performs whole genome regression modeling using [regenie](https://github.com/rgcgithub/regenie). For a deep understanding of regenie, I suggest to read [the paper by Mbatchou et al.](https://doi.org/10.1038/s41588-021-00870-7). In brief, regenie can be used for quantitative and binary traits and that it first builds regression models according to the leave-one-chromosome-out (LOCO) scheme. These models are then used as covariates in the second step, which tests the association of each SNP with the phenotype. The advantage is that it is computationally efficient and fast, meaning that it can also be used on very large datasets such as UK Biobank.

### Error-prone data preparation steps are performed by the pipeline

However, before you actually perform a GWAS, you need to properly prepare your data including converting file formats, filtering data and correct preparation of phenotypes and covariates. These steps are tedious and prone to error - and can also be very time consuming if it's your first time working with command line programs.
Luckily, the GWAS pipeline presented here does some of the work for you and summarizes these preparation steps in the end in a report file:
However, before you actually perform a GWAS, you need to properly prepare your data. This included converting file formats, filtering data and preparing phenotypes and covariates files. These steps are tedious and prone to error - and can also be very time consuming if it's your first time working with command line programs.
Luckily, the GWAS pipeline presented here does some of the work for you and summarizes these preparation steps in a report file:

1. It validates the phenotype and (optional) covariate files that you prepared.
2. For step 1 regenie developers recommend to use directly genotyped variants that have passed quality control (QC). The pipeline performs the QC for you, based on minor allele frequency and count, genotype missingness, Hardy-Weinberg equilibrium and sample missingness. In addition, the regenie developers do not recommend to use >1M SNPs for step 1. Therefore, the pipeline can additionally perform pruning before step 1 of regenie is run. By default, certain QC thresholds are set and pruning is disabled but of course you can adapt the QC thresholds and pruning settings.
3. In step 2 all available genotypes should be used. If you have for example imputed your data with the Michigan Imputation Server, it is in the VCF format, that is not supported by regenie. The pipeline can convert your VCF imputed data into the correct file format. In addition, you can also set a threshold for the imputation score and the minor allele count for the imputed variants that are included in step 2.
2. For step 1, regenie developers recommend to use directly genotyped variants that have passed quality control (QC). The pipeline performs the QC for you, based on minor allele frequency and count, genotype missingness, Hardy-Weinberg equilibrium and sample missingness. In addition, the regenie developers do not recommend to use >1M SNPs for step 1. Therefore, the pipeline can additionally perform pruning before regenie step 1 is run. By default, certain QC thresholds are set and pruning is disabled but of course you can adapt the QC thresholds and pruning settings.
3. In step 2, all available genotypes should be used. For example, if you imputed your data with the Michigan Imputation Server, it is in the VCF format, which is not supported by regenie. The pipeline can convert your VCF imputed data into the required file format. In addition, you can also set a threshold for the imputation score and the minor allele count for the imputed variants that are included in step 2.

### The pipeline automatically creates Manhattan and QQ plots and annotates your results

Expand All @@ -26,4 +26,4 @@ In addition to performing these data preparation steps, the pipeline also perfor
2. Regenie gives you the GWAS summary statistics as a large file with the ending *.regenie.gz*. If your computer does not have so much RAM, loading this file for example into R to perform some further analyses can take quite long. The pipeline additionally outputs you a file with the ending *.filtered.annotated.txt.gz*. This file is much smaller because it only contains the summary SNPs filtered for a minimum ‑log<sub>10</sub>(P) (by default =5) and in addition the nearest genes have been annotated to these SNPs.

### Run pipeline with Nextflow
And last but not least it is also important to mention that this pipeline is built with the workflow manager [Nextflow](https://www.nextflow.io/). To use the pipeline, you don't need to know how it works let alone build one on your own but I think one important advantage is helpful to know: The software that is needed for all the steps described above is downloaded when the pipeline is initiated and the software is stored in a *container*. So no matter if you perform the data analysis on different computers or if you need to rerun it in two years: As long as you use the same input data and the same configuration of the pipeline, you always get the same results. This further increases reproducibility and saves a lot of time if you suddenly need to work with a different computer or server.
Last but not least, it is also important to mention that this pipeline is built with the workflow manager [Nextflow](https://www.nextflow.io/). To use the pipeline, you don't need to know how it works let alone build one on your own, but there is one main advantage that I think is helpful to know: all the softwares that are used in the steps described above are stored in a *container* and are downloaded when the pipeline is initiated. So no matter if you perform the data analysis on different computers or if you need to rerun it in two years, as long as you use the same input data and the same configuration of the pipeline, you always get the same results. This further increases reproducibility and saves a lot of time if you suddenly need to work with a different computer or server.
4 changes: 2 additions & 2 deletions docs/gwas-regenie-101/run-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ nav_order: 3

To run the pipeline on your data, prepare the phenotype and (optional) covariate files as described [here](https://rgcgithub.github.io/regenie/options/#input)). In addition, you need the genotyping data for step 1 in bim,bed,fam format and your imputed genotypes in VCF or BGEN format. Transfer all these files using FileZilla to the folder of your choice on the server.

Now, you have to prepare a configuration file for the pipeline. For this, you can use any text editor but for example the text editor [Atom](https://atom.io/) is very convenient since it can also highlight different kinds of codes etc. The required and optional parameters for the configuration file are all listed [here](../params/params) of the pipeline. To make your own config file, it is the easiest to copy one of the exemplary [config files](https://github.com/genepi/nf-gwas/tree/main/conf/tests). Adapt all the paths and parameters to fit your data and save the file (e.g. as: first-gwas.config). If you added additional parameters, just make sure, that they are within the curly brackets.
Now, you we need to prepare a configuration file for the pipeline. You can use any text editor! For example, we use the IDE [Visual Studio Code](https://code.visualstudio.com/), which has some very convenient features, including highlighting different code elements. The required and optional parameters for the configuration file are all listed [here](../params/params) of the pipeline. To make your own config file, it is the easiest to copy one of the exemplary [config files](https://github.com/genepi/nf-gwas/blob/main/conf/test.config). Adapt all the paths and parameters to fit your data and save the file (e.g. as: first-gwas.config). If you´ve used additional parameters, just make sure that they are within the curly brackets.

Just one possibly helpful fact on the side here: as indicated on the GitHub repository, the genotypes have to be a single merged file but the imputed genotypes can also be one file per chromosome. If we have them in single files per chromosome we can put the path for example as follows into the configuration file `/home/myHome/GWAS/imputed\_data/\*vcf.gz`. The asterisk (\*) is a wildcard. So it will take all the files from the imputed\_data folder that end with `vcf.gz`.
Useful tip: as indicated on the GitHub repository, the genotypes have to be a single merged file but the imputed genotypes can also be one file per chromosome. If we have them in single files per chromosome we can put the path for example as follows into the configuration file `/home/myHome/GWAS/imputed\_data/\*vcf.gz`. The asterisk (\*) is a wildcard. So it will take all the files from the imputed\_data folder that end with `vcf.gz`.

Now you can transfer the file via FileZilla to your folder of choice on the server (as an example let's say we put the `first-gwas.config` into the folder `/home/myHome/GWAS`).

Expand Down
6 changes: 3 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ A Nextflow pipeline to perform genome-wide association studies (GWAS).
This cloud-ready GWAS pipeline allows you to run **single variant tests**, **gene-based tests** and **interaction testing** using [REGENIE](https://github.com/rgcgithub/regenie) in an automated and reproducible way.

For single variant tests, the pipeline works with BGEN (e.g. from UK Biobank) or VCF files (e.g. from [Michigan Imputation Server](https://imputationserver.sph.umich.edu/)). For gene-based tests, we currently support BED files as an input.
The pipeline outputs association results (tabixed, works with e.g. LocusZoom out of the box), annotated loci tophits and an interactive HTML report provding statistics and plots.
The output files of the pipeline include results of the association tests (in tabix indexed format, which works with e.g. LocusZoom out of the box), annotated loci tophits and an interactive HTML report with summary statistics and plots.

The single-variant pipeline currently includes the following steps:

Expand All @@ -33,9 +33,9 @@ The single-variant pipeline currently includes the following steps:

3. Prune micro-array data using [plink2](https://www.cog-genomics.org/plink/2.0/) (optional).

4. Filter micro-array data using plink2 based on MAF, MAC, HWE, genotype missingess and sample missingness.
4. Filter micro-array data using plink2 based on MAF, MAC, HWE, genotype missingness and sample missingness.

5. Run [regenie](https://github.com/rgcgithub/regenie) and tabix results to use with LocusZoom.
5. Run [regenie](https://github.com/rgcgithub/regenie) and index (tabix) results to use with LocusZoom.

6. Parse regenie log and create summary statistics.

Expand Down

0 comments on commit 9a73387

Please sign in to comment.