Skip to content

Snakemake pipeline for demultiplexing and moving fastq files from an in-house AVITI sequencer

Notifications You must be signed in to change notification settings

jrzoe/aviti-fastq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

aviti-fastq

This simple pipeline is designed to demultiplex reads generated by an in-house Element Biosciences AVITI sequencer and move the resulting fastq files into a structured data directory. This approach is especially helpful for large sequencing projects involving multiple sequencing runs with hundreds of samples. For example, for projects with many cell lines, each with a few distinct cell line modifications, it is helpful to organize output files into informative subdirectories for ease of exploration. To do this, the pipeline relies on a comma-separated metadata file, which is described in detail below.

Preparing the inputs

Sample metadata

The pipeline relies on a sample metadata file to determine which sequencing directories to process and where to move the resulting fastq files. In general, it is best to keep any and all metadata you may find handy in the sample sheet, but only some will be used for this pipeline. For column names, use underscores rather than spaces to delimit words. In addition, only include samples which have not yet been processed with this pipeline.

The sample metadata must be in .csv format and has a few required columns, with an example file found in this repo. The following columns are strictly required to run this pipeline:

  • sample_basename: This name should be unique for each distinct biological sample/replicate (e.g. WT_1, WT_2, KO_1, KO_2). Avoid using R1 or R2 at the end of a sample_basename to reflect biological replicates, as fastq file names use R1 and R2 to reflect the separate read pairs for a given sample. Technical replicates (i.e. running one sample over multiple sequencing runs) must have the same sample_basename value and must instead have unique values for seq_replicate and seq_run_name.
  • seq_replicate: This should be a unique value for each technical sequencing replicate for a given sample. For example, if one sequences a pool of sample libraries over three sequencing runs, the sample_basename for each technical replicate will be the same, but the value for seq_replicate must be different (A, B, and C). Likewise, their value for seq_run_name will be different, but the value used in seq_replicate will be appended to the resulting fastq files. While these can be any value, I'd recommend successive capital letters to keep the resulting file names short.
  • seq_run_name: The exact name of the sequencing run directory which contains the raw data for the run. Of note, the directory which hosts the raw run data is supplied in config/config.yaml.

Configuration file

Within config.yaml, the user can edit key parameters detailed therein. Of note, this is where one can (optionally) specify additional column names that will be used to generate subdirectories within the data output directory, to facilitate organization. The names listed must exactly match the column names in the sample metadata file you provide. The order also matters - the first column name provided will be the first subdirectory created, the second column name will be the second subdirectory created, and so on.

Configuration parameters can also be specified at the command line (e.g. --config samples=<path/to/meta.csv>), which will overwrite anything written in the config.yaml for the run.

Running the pipeline

Setting up environment

To run on an HPC using Slurm, one can run the pipeline within either an interactive session (ideally running within a tmux window) or via job submission. Snakemake itself is lightweight and submits all jobs to compute nodes; running it within an interactive session is fine and makes it easy to check on progress. While the author of the Slurm plugin recommends running from the head node, not all clusters allow this. One will need a conda environment with four packages: snakemake, snakemake-executor-plugin-slurm, snakedeploy and snakemake-wrapper-utils. This environment can be created with:

conda create -n snakemake -c bioconda snakemake snakemake-executor-plugin-slurm snakedeploy snakemake-wrapper-utils

If your HPC has a particularly outdated conda version installed for all users, one can add anaconda::conda to the list of packages to use the faster mamba solver. If an executor other than Slurm is needed, other plugins are available (be sure to update the executor specified in your worfklow profile, accordingly).

Deploying this pipeline

Deploy this pipeline into the output data directory specified within the config.yaml file, within a subdirectory called snakemake. For example, you would navigate to /path/to/config_data_dir/snakemake, then use snakedeploy to deploy the pipeline and keep it associated with the data it will process. Be sure to specify the version number being used, for example:

snakedeploy deploy-workflow --tag <version_number> https://github.com/jrzoe/aviti-fastq.git <dest-dir> 

Profiles

A profile for executing this pipeline on an HPC using Slurm is provided in the workflow/profile/default directory. Reasonable time and resource limits are set, but be sure to adjust partitions based on your cluster.

Running

After activating the conda environment, navigate to the root directory of this pipeline. Note, the bases2fastq software is provided as a Docker image, which can easily be run on most HPCs using Singularity. Be sure to specify any directories that need to be bound to the container in the command below.

snakemake -n --workflow-profile workflow/profile/default --use-singularity --singularity-args "--bind </path/to/dir1>,/<path/to/dir2>"

In theory, the bulky --singularity-args flag can be specified in the config file, but it throws an error pertaining to "read-only" file systems. Appears to be a bug (as of snakemake v8.27). For now, must specify at command line.

If the dry run looks appropriate, remove the -n flag from the above command and the run will begin.

About

Snakemake pipeline for demultiplexing and moving fastq files from an in-house AVITI sequencer

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages