This simple pipeline is designed to demultiplex reads generated by an in-house Element Biosciences AVITI sequencer and move the resulting fastq files into a structured data directory. This approach is especially helpful for large sequencing projects involving multiple sequencing runs with hundreds of samples. For example, for projects with many cell lines, each with a few distinct cell line modifications, it is helpful to organize output files into informative subdirectories for ease of exploration. To do this, the pipeline relies on a comma-separated metadata file, which is described in detail below.
The pipeline relies on a sample metadata file to determine which sequencing directories to process and where to move the resulting fastq files. In general, it is best to keep any and all metadata you may find handy in the sample sheet, but only some will be used for this pipeline. For column names, use underscores rather than spaces to delimit words. In addition, only include samples which have not yet been processed with this pipeline.
The sample metadata must be in .csv format and has a few required columns, with an example file found in this repo. The following columns are strictly required to run this pipeline:
sample_basename: This name should be unique for each distinct biological sample/replicate (e.g. WT_1, WT_2, KO_1, KO_2). Avoid using R1 or R2 at the end of asample_basenameto reflect biological replicates, as fastq file names use R1 and R2 to reflect the separate read pairs for a given sample. Technical replicates (i.e. running one sample over multiple sequencing runs) must have the samesample_basenamevalue and must instead have unique values forseq_replicateandseq_run_name.seq_replicate: This should be a unique value for each technical sequencing replicate for a given sample. For example, if one sequences a pool of sample libraries over three sequencing runs, thesample_basenamefor each technical replicate will be the same, but the value forseq_replicatemust be different (A, B, and C). Likewise, their value forseq_run_namewill be different, but the value used inseq_replicatewill be appended to the resulting fastq files. While these can be any value, I'd recommend successive capital letters to keep the resulting file names short.seq_run_name: The exact name of the sequencing run directory which contains the raw data for the run. Of note, the directory which hosts the raw run data is supplied inconfig/config.yaml.
Within config.yaml, the user can edit key parameters detailed therein. Of note, this is where one can (optionally) specify additional column names that will be used to generate subdirectories within the data output directory, to facilitate organization. The names listed must exactly match the column names in the sample metadata file you provide. The order also matters - the first column name provided will be the first subdirectory created, the second column name will be the second subdirectory created, and so on.
Configuration parameters can also be specified at the command line (e.g. --config samples=<path/to/meta.csv>), which will overwrite anything written in the config.yaml for the run.
To run on an HPC using Slurm, one can run the pipeline within either an interactive session (ideally running within a tmux window) or via job submission. Snakemake itself is lightweight and submits all jobs to compute nodes; running it within an interactive session is fine and makes it easy to check on progress. While the author of the Slurm plugin recommends running from the head node, not all clusters allow this. One will need a conda environment with four packages: snakemake, snakemake-executor-plugin-slurm, snakedeploy and snakemake-wrapper-utils. This environment can be created with:
conda create -n snakemake -c bioconda snakemake snakemake-executor-plugin-slurm snakedeploy snakemake-wrapper-utilsIf your HPC has a particularly outdated conda version installed for all users, one can add anaconda::conda to the list of packages to use the faster mamba solver. If an executor other than Slurm is needed, other plugins are available (be sure to update the executor specified in your worfklow profile, accordingly).
Deploy this pipeline into the output data directory specified within the config.yaml file, within a subdirectory called snakemake. For example, you would navigate to /path/to/config_data_dir/snakemake, then use snakedeploy to deploy the pipeline and keep it associated with the data it will process. Be sure to specify the version number being used, for example:
snakedeploy deploy-workflow --tag <version_number> https://github.com/jrzoe/aviti-fastq.git <dest-dir> A profile for executing this pipeline on an HPC using Slurm is provided in the workflow/profile/default directory. Reasonable time and resource limits are set, but be sure to adjust partitions based on your cluster.
After activating the conda environment, navigate to the root directory of this pipeline. Note, the bases2fastq software is provided as a Docker image, which can easily be run on most HPCs using Singularity. Be sure to specify any directories that need to be bound to the container in the command below.
snakemake -n --workflow-profile workflow/profile/default --use-singularity --singularity-args "--bind </path/to/dir1>,/<path/to/dir2>"In theory, the bulky --singularity-args flag can be specified in the config file, but it throws an error pertaining to "read-only" file systems. Appears to be a bug (as of snakemake v8.27). For now, must specify at command line.
If the dry run looks appropriate, remove the -n flag from the above command and the run will begin.