Curio Seeker

Curio Seeker is a best-practice analysis pipeline for the Curio Seeker Kit.

Tools used for major steps:

Read 1 handling: Curio Seeker custom code, filtering read pairs with correct barcode structure in read 1
Barcode matching: Curio Seeker custom code, matching bead barcodes found from sequencing against whitelist ${Tile_ID}_BeadBarcodes.txt, with Hamming distance <= 2
Aligner: STAR
Feature extraction: FeatureCounts
UMI deduplication: UMItools, directional, hamming distance < 2
Analysis: Seurat
Analysis: Anndata

Installation

Install the following to your local environment, if not already:
- Install Singularity or Docker: Singularity(formerly)/Apptainer (single-server workstation, slurm) Docker (aws batch)
- Java
- Nextflow (>=23.10.0)
In some special cases you may also need to install:
- AWS
- Conda

Setup

Download the latest stable release (v2.5) of the Nextflow pipeline provided by Curio to your local environment:

wget https://curioseekerbioinformatics.s3.us-west-1.amazonaws.com/CurioSeeker_v2.5.0/curioseeker-2.5.0.tar.gz -O - | \
tar -xzf -
cd curioseeker-*

Pull an image of curio-seeker-pipeline into your local environment:

Public Singularity container:

chdir $HOME/
mkdir -p .singularity
wget https://curioseekerbioinformatics.s3.us-west-1.amazonaws.com/CurioSeeker_v2.5.0/curio-seeker-singularity-2024.02.22.sif -P .singularity/

Or pull Public Docker container (sudo privileges may be required):
```
docker pull curiobioinformatics/curio-seeker-pipeline:2024.02.22
```

Edit the nextflow.config (in the curioseeker-2.5.0 script folder)
- Edit parameter 'curio_seeker_singularity'
  
  If using singularity:
  
  curio_seeker_singularity = 'file:////$HOME/.singularity/curio-seeker-singularity:v2.5.0.sif'
  
  If using docker:
  
  no modifications are needed

Platform specific nextflow configuration

If you are using a single server linux workstation:

No additional configuration needed. Skip this entire section.
If you are using an HPC:

A config file specific to your platform is needed when triggering the pipeline. Follow the instructions below to construct a config file specific to your HPC.
1. Choose an example config that suits your HPC:
  - Example slurm.config
  - Example sge.config
2. Edit the example slurm.config or sge.config:
  - For each process label, (i.e. preceded with ‘withLabel:’, such as 'process_medium' shown in the example below), identify an available partition on your HPC with the closest memory and number of cpus as defined by values of memory and cpus for this process.
```
withLabel:process_medium {
    cpus    = 16
    memory  = ''
    queue   = ''
    clusterOptions = '--constraint m5a.2xlarge'
}
```
  - Modify the value of cpus to those associated with the identified HPC partition. Quotation marks are NOT needed.
  - Leave the value of memory as empty string
  - Modify the value of the queue to the name of the identified HPC partition. Quotation marks are required.
  - Modify the value of clusterOptions to match your partition name. Quotation marks are required.

Reference Genome

All built references that are compatible with the pipeline have the same folder structure, as shown below.

Mus_musculus/
 └── Ensembl
     └── GRCm38
         ├── Annotation
         │   └── Genes
         │       ├── genes.gtf
         │       ├── mt_genes.txt
         │       └── rRNA_genes.txt
         └── Sequence
             ├── STARIndex
             │   ├── chrLength.txt
             │   ├── chrNameLength.txt
             │   ├── chrName.txt
             │   ├── chrStart.txt
             │   ├── exonGeTrInfo.tab
             │   ├── exonInfo.tab
             │   ├── geneInfo.tab
             │   ├── Genome
             │   ├── genomeParameters.txt
             │   ├── SA
             │   ├── SAindex
             │   ├── sjdbInfo.txt
             │   ├── sjdbList.fromGTF.out.tab
             │   ├── sjdbList.out.tab
             │   └── transcriptInfo.tab
             └── WholeGenomeFasta
                 └── genome.fa

To Generate reference:

Option A: Download Genome from Curio Download prebuilt reference (via STAR 2.6.1d) in structure above from the Curio Knowledgebase. Each pre-built reference is identified in the pipeline by its Reference ID (e.g. GRCm38), which is specified in the sample sheet. Download by clicking on the link below or by using CLI:
```
wget https://curioseekerbioinformatics.s3.us-west-1.amazonaws.com/references/Mus_musculus.tar.gz -O - | tar -xzf -
```
Then edit the genome column in the sample sheet to match the selected reference ID (e.g. GRCm38)
- Rattus norvegicus, mRATBN7.2
- Danio rerio, GRCz11
- Fasciola hepatica, WBPSI7
- Arabidopsis thaliana, TAIR10
- Gallus gallus, GRCg6a
- Glycine max, Glycine_max_v2.1
- Homo sapiens, `GRCh38'
- Mus musculus, GRCm38
- Zea mays, B73_RefGen_v4
- Homo sapiens and Mus musculus, GRCh38_mm10

Option B: Download Genome from iGenome

Download prebuilt reference (via STAR 2.6.1d) in structure above from iGenome. * mt_genes.txt, rRNA_genes.txt can be created using below command.

grep "^${Mitochondrial_chromosome_name}" /path/to/.gtf | grep -oP "gene_name \"\K[^\"]+" | sort | uniq > mt_genes.txt
grep rRNA /path/to/.gtf | grep -oP "gene_name \"\K[^\"]+" | sort | uniq > rRNA_genes.txt

Option C: Custom build reference
- Follow the instructions on building a custom reference. Then modify the sample sheet with the correct Reference ID (i.e. GRCm38)
- For ALL reference genome options, check to ensure the Reference ID (i.e. GRCm38) you named your new reference already exist in /conf/igenomes.config.
  - If yes, simply change genome in samplesheet to the assembly name.
  - If no, add an entry in /conf/igenomes.config with the assembly name you chose and change genome in samplesheet to the name chosen.

Prepare Input Files

Prepare the following inputs before triggering the pipeline Before processing your own data, we recommend using example inputs and example reference to test if the pipeline is properly installed.

Place the following input files in desired directory on your computational platform. * R1.fastq.gz * R2.fastq.gz * Whitelist bead barcode file ${Tile_ID}_BeadBarcodes.txt (Downloadable from Curio Website) * samplesheet.csv

Paired FASTQ files
- R1.fastq.gz
- R2.fastq.gz
Note Ensure to check the proper format of your FASTQ input files:
- There should only be single R1 and single R2 fastq file. If you have multiple, then you should concatenate them into a single R1 and single R2.
- All FASTQ should be in compressed gz format
- All reads in the FASTQ R1 file should be the same length
- All reads in the FASTQ R1 file should be at least 50bp
- R1 and R2 FASTQ contain the same (paired) number of reads
Whitelist bead barcode file
- This contains the mapping of barcode to spatial coordinate
- ${Tile_ID}_BeadBarcodes.txt (Downloadable from Curio Website)
- Ensure to unzip the downloaded whitelist bead barcode file before use
The sample sheet file
- This file should have two rows: a header row and a data row
```
sample,experiment_date,barcode_file,fastq_1,fastq_2,genome
sample1,yyyy-mm-dd,/path/to/${Tile_ID}_BeadBarcodes.txt,/path/to/R1.fastq.gz,/path/to/R2.fastq.gz,GRCm38
```
Note:
- sample: sample name
- Avoid using period '.' in the sample column.
- experiment_date: experiment date in yyyy-mm-dd format
- Opening samplesheet.csv in MS Excel may change the format of experiment_data and make it unreadable.
- barcode_file: is the full directory path to whitelist barcode file.
- fastq_1: Full path to FASTQ read 1
- fastq_2: Full path to FASTQ read 2
- genome: is the Reference ID of the genome (GRCm38). It is also the key value in the conf/igenome.config file (if you are using a custom reference genome)
- If using a custom reference, modify your sample sheet following instructions here
- If running multiple samples in parallel, add each sample per line to the sample sheet. If executing on a single server workstation, the required memory will increase with the number of samples running in parallel.

Platform specific nextflow configuration

On single server linux workstation: No need for additional config file.
On slurm HPC: Edit slurm.config below to suit your HPC's configuration. For each label, edit 'queue' to your HPC partition name that is closest in 'cpu' and 'memory' as listed in the example slurm.config. Then change the 'cpus'and 'memory' value to the partition 'queue' is assigned to.
- Example slurm.config
On SGE HPC: Edit sge.config below to suit your HPC's configuration. For each label, edit 'queue' to your HPC partition name that is closest in 'cpu' and 'memory' as listed in the example sge.config. Then change the 'cpus'and 'memory' value to the partition 'queue' is assigned to.
- Example sge.config

Start the pipeline

On single server linux workstation, using singularity:

cd curioseeker-2.5.0
nextflow run main.nf \
     --input /path/to/samplesheet.csv \
     --outdir ${root_output_dir}/results/ \
     -work-dir ${root_output_dir}/work/ \
     --igenomes_base /path/to/references/ \
     -profile singularity

On single server linux workstation, using docker:

cd curioseeker-2.5.0
nextflow run main.nf \
     --input /path/to/samplesheet.csv \
     --outdir ${root_output_dir}/results/ \
     -work-dir ${root_output_dir}/work/ \
     --igenomes_base /path/to/references/ \
     -profile docker

On slurm HPC, using Singularity:

cd curioseeker-2.5.0
nextflow run main.nf \
    --input /path/to/samplesheet.csv \
    --outdir ${root_output_dir}/results/ \
    -work-dir ${root_output_dir}/work/ \
    --igenomes_base /path/to/references/ \
    -profile slurm -config /path/to/slurm.config

On SGE HPC:

nextflow run commercial_seeker/main.nf \
    --input /path/to/samplesheet.csv \
    --outdir ${root_output_dir}/results/ \
    -work-dir ${root_output_dir}/work/ \
    --igenomes_base /path/to/references/ \
    -profile sge -config /path/to/sge.config

Note:

/path/to/references/ is the path to the base directory 'Mus_musculus' for the genomic reference.
-resume may be added to resume a previously interrupted execution.

Advanced settings

If using Singularity, singularity images for public tools will be pulled to the work folder during each pipeline run. To save storage space, these images can be stored in a specified folder once and reused in future runs. This can be achieved by setting the env variable NXF_SINGULARITY_CACHEDIR to a specified folder, as shown below. Replace ${path_to} to the full path to folder .singularity on your platform.

$ export NXF_SINGULARITY_CACHEDIR=${path_to}/.singularity

All public singularity images pulled during a pipeline run will be saved in this folder. As long as this env var is defined before a run, the images can be reused for future pipeline runs.

Output

After a successful run, output can be found in ${root_output_dir}/results/:

${root_output_dir}/results
 └── OUTPUT
     └── sample1
         ├── sample1_Report.html                            - Run report
         ├── sample1_MoleculesPerMatchedBead.mtx            - Expression table Sparse Matrix (column=BeadBarcodes, row=Genes)
         ├── sample1_barcodes.tsv                           - Expression table BeadBarcode
         ├── sample1_genes.tsv                              - Expression table Genes
         ├── sample1_MatchedBeadLocation.csv                - Spatial coordinates of BeadBarcodes
         ├── sample1_seurat.rds                             - Seurat object with expression table, spatial coordinants, clustering results
         ├── sample1_anndata.h5ad                           - Anndata object with expression table, spatial coordinants, clustering results
         ├── sample1_cluster_assignment.txt                 - BeadBarcode cluster assignment
         ├── sample1_variable_features_clusters.txt         - Top cluster defining genes
         ├── sample1_variable_features_spatial_moransi.txt  - Top spatially variable genes
         ├── sample1_Metrics.csv                            - Run metrics
         └── sample1_Report_files                           - Individual plots in sample1_report.html
     └── sample2
     └── sample3
     ...

Credits

Curio Seeker was originally written by Curio Bioscience.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Curio Seeker

Tools used for major steps:

Installation

Setup

Platform specific nextflow configuration

Reference Genome

Prepare Input Files

Platform specific nextflow configuration

Start the pipeline

Advanced settings

Output

Credits

Files

README.md

Latest commit

History

README.md

File metadata and controls

Curio Seeker

Tools used for major steps:

Installation

Setup

Platform specific nextflow configuration

Reference Genome

Prepare Input Files

Platform specific nextflow configuration

Start the pipeline

Advanced settings

Output

Credits