A comprehensive pipeline for single-cell Perturb-Seq analysis that enables robust processing and analysis of CRISPR screening data at single-cell resolution.
Nextflow and Singularity must be installed before running the pipeline:
Workflow manager for executing the pipeline:
conda install bioconda::nextflow
Container platform that must be available in your execution environment.
This is a seamless pipeline execution monitoring system that offers a web-based interface for workflow management.
To enable Nextflow Tower, we require a TOWER_ACCESS_TOKEN.
To obtain your token:
- Create/login to your account at cloud.tower.nf
- Navigate to Settings > Your tokens
- Click "Add token" and generate a new token
- Set as environment variable:
export TOWER_ACCESS_TOKEN=your_token_here
If you do not want this feature, go to the input.config and scroll all the way down to disable this:
tower {
enabled = false
accessToken = "${TOWER_ACCESS_TOKEN ?: ''}"
To install the pipeline:
git clone https://github.com/pinellolab/CRISPR_Pipeline.git
: Contains cell barcode and UMI sequences{sample}_R2.fastq.gz
: Contains transcript sequences
: Defines RNA sequencing structure and parametersguide_seqspec.yml
: Specifies guide RNA detection parametershash_seqspec.yml
: Defines cell hashing structure (required if using cell hashing)whitelist.txt
: List of valid cell barcodes
: Contains guide RNA information and annotationshash_metadata.tsv
: Cell hashing sample information (required if using cell hashing)pairs_to_test.csv
: Defines perturbation pairs for comparison analysis (required if testing predefined pairs)
This pipeline requires a specific data structure to function properly. Below is an overview of the required directory organization:
π example_data/
βββ π fastq_files/
β βββ π {sample}_R1.fastq.gz
β βββ π {sample}_R2.fastq.gz
β βββ π ...
βββ π yaml_files/
β βββ π rna_seqspec.yml
β βββ π guide_seqspec.yml
β βββ π hash_seqspec.yml (required if using cell hashing)
β βββ π whitelist.txt
βββ π guide_metadata.tsv
βββ π hash_metadata.tsv (required if using cell hashing)
βββ π pairs_to_test.csv (required if testing predefined pairs)
For detailed specifications, see our documentation.
Make sure to specify your data paths and analysis parameters in configs/pipeline.config
Configure input.config
to match your computing environment. For example:
withName:process_name {
cpus = 4 # Number of CPU cores per mapping process (default: 4)
memory = 64.GB # RAM allocation per mapping process (default: 64GB)
π‘ Note: Start with these default values and adjust based on your dataset size and system capabilities.
First, make the scripts executable:
chmod +x bin/*
Export Nextflow Tower Token
export TOWER_ACCESS_TOKEN=your_token_here
Launch the pipeline:
nextflow run main.nf -c input.config
- Watch the terminal output for progress updates
- Check the
file for detailed execution logs
- Memory errors: Increase the
parameter ininput.config
- Missing files: Double-check paths in
and actual files inexample_data
The output files will be generated in the pipeline_outputs
and pipeline_dashboard
Within the pipeline_outputs
directory, you will find:
- inference_mudata.h5mu - MuData format output
- per_element_output.tsv - Per-element analysis
- per_guide_output.tsv - Per-guide analysis
π pipeline_outputs/
βββ π inference_mudata.h5mu
βββ π per_element_output.tsv
βββ π per_guide_output.tsv
For details, see our documentation.
The pipeline produces several figures:
Within the pipeline_dashboard
directory, you will find:
Evaluation Output:
: Gene interaction networks visualization.volcano_plot.png
: gRNA-gene pairs analysis.- IGV files (
): Genome browser visualization files.
Analysis Figures:
: Knee plot of UMI counts vs. barcode index.scatterplot_scrna.png
: Scatterplot of total counts vs. genes detected, colored by mitochondrial content.violin_plot.png
: Distribution of gene counts, total counts, and mitochondrial content.scRNA_barcodes_UMI_thresholds.png
: Number of scRNA barcodes using different Total UMI thresholds.guides_per_cell_histogram.png
: Histogram of guides per cell.cells_per_guide_histogram.png
: Histogram of cells per guide.guides_UMI_thresholds.png
: Simulating the final number of cells with assigned guides using different minimal number thresholds (at least one guide > threshold value). (Use it to inspect how many cells would have assigned guides. This can be used to check if the final number of cells with guides fit with your expected number of cells)guides_UMI_thresholds.png
: Histogram of the number of sgRNA represented per cellcells_per_htp_barplot.png
: Number of Cells across Different HTOsumap_hto.png
: UMAP Clustering of Cells Based on HTOs (The dimensions represent the distribution of HTOs in each cell)umap_hto_singlets.png
: UMAP Clustering of Cells Based on HTOs (multiplets removed)
seqSpec Plots:
: The frequency of each nucleotides along the Read 1 (Use to inspect the expected read parts with their expected signature) and Read 2 (Use to inspect the expected read parts with their expected signature).
π pipeline_dashboard/
βββ π dashboard.html
βββ π evaluation_output/
β βββ πΌοΈ network_plot.png
β βββ πΌοΈ volcano_plot.png
β βββ π igv.bedgraph
β βββ π igv.bedpe
βββ π figures/
β βββ πΌοΈ knee_plot_scRNA.png
β βββ πΌοΈ scatterplot_scrna.png
β βββ πΌοΈ violin_plot.png
β βββ πΌοΈ scRNA_barcodes_UMI_thresholds.png
β βββ πΌοΈ guides_per_cell_histogram.png
β βββ πΌοΈ cells_per_guide_histogram.png
β βββ πΌοΈ guides_UMI_thresholds.png
β βββ πΌοΈ cells_per_htp_barplot.png
β βββ πΌοΈ umap_hto.png
β βββ πΌοΈ umap_hto_singlets.png
βββ π guide_seqSpec_plots/
β βββ πΌοΈ seqSpec_check_plots.png
βββ π hashing_seqSpec_plots/
βββ πΌοΈ seqSpec_check_plots.png
To ensure proper pipeline functionality, we provide two extensively validated datasets for testing purposes.
The TF_Perturb_Seq_Pilot dataset was generated by the Gary-Hon Lab and is available through the IGVF Data Portal under Analysis Set ID: IGVFDS4389OUWU. To access the fastq files, you need to:
First, register for an account on the IGVF Data Portal to obtain your access credentials.
Once you have your credentials, you can use our provided Python script to download all necessary FASTQ files:
cd example_data python download_fastq.py \ --sample per-sample_file.tsv \ --access-key YOUR_ACCESS_KEY \ --secret-key YOUR_SECRET_KEY
π‘ Note: You'll need to replace
with the credentials from your IGVF portal account. These credentials can be found in your IGVF portal profile settings.
All other required input files for running the pipeline with this dataset are already included in the repository under the example_data
This dataset comes from a large-scale CRISPR screen study published in Cell (Gasperini et al., 2019: "A Genome-wide Framework for Mapping Gene Regulation via Cellular Genetic Screens") and provides an excellent resource for testing the pipeline. The full dataset, including raw sequencing data and processed files, is publicly available through GEO under accession number GSE120861.
Environment Setup
# Clone and enter the repository git clone https://github.com/pinellolab/CRISPR_Pipeline.git cd CRISPR_Pipeline
Choose Your Dataset and Follow the Corresponding Instructions:
# Run with default configuration nextflow run main.nf -c input.config
Set up the configuration files:
# Copy configuration files and example data cp -r example_gasperini/configs/* configs/ cp -r example_gasperini/example_data/* example_data/
Obtain sequencing data:
- Download a subset of the dataset gasperini in your own server.
- Place files in
NTHREADS=16 wget https://github.com/10XGenomics/bamtofastq/releases/download/v1.4.1/bamtofastq_linux; chmod +x bamtofastq_linux wget https://sra-pub-src-1.s3.amazonaws.com/SRR7967488/pilot_highmoi_screen.1_CGTTACCG.grna.bam.1;mv pilot_highmoi_screen.1_CGTTACCG.grna.bam.1 pilot_highmoi_screen.1_CGTTACCG.grna.bam ./bamtofastq_linux --nthreads="$NTHREADS" pilot_highmoi_screen.1_CGTTACCG.grna.bam bam_pilot_guide_1 wget https://sra-pub-src-1.s3.amazonaws.com/SRR7967482/pilot_highmoi_screen.1_SI_GA_G1.bam.1;mv pilot_highmoi_screen.1_SI_GA_G1.bam.1 pilot_highmoi_screen.1_SI_GA_G1.bam ./bamtofastq_linux --nthreads="$NTHREADS" pilot_highmoi_screen.1_SI_GA_G1.bam bam_pilot_scrna_1
Now you should see the
directories inside theexample_data/fastq_files
directory. Insidebam_pilot_guide_1
, there are multiple sets of FASTQ files. -
Prepare the whitelist:
# Extract the compressed whitelist file unzip example_data/yaml_files/3M-february-2018.txt.zip
Now you should see
directory. -
Launch the pipeline:
# Run with Gasperini configuration nextflow run main.nf -c input.config
The pipeline generates two directories upon completion:
: Contains all analysis resultspipeline_dashboard
: Houses interactive visualization reports
If you encounter any issues during testing:
- Review log files and intermediate results in the
directory - Verify that all input files meet the required format specifications
For additional support or questions, please open an issue on our GitHub repository.