Skip to content

Commit

Permalink
Add local and SLURM deployment script and populate README (#13)
Browse files Browse the repository at this point in the history
Added local and slurm files and updated README
  • Loading branch information
alejoe91 authored Jun 13, 2024
1 parent f28ba10 commit a80dd67
Show file tree
Hide file tree
Showing 9 changed files with 1,386 additions and 10 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
/data/
/*data*/
/results/
/*results*/
/pipeline/work/
**/.nextflow
**/.nextflow*
312 changes: 303 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,306 @@

Electrophysiology analysis pipeline using [Kilosort2.5](https://github.com/MouseLand/Kilosort/tree/v2.5) via [SpikeInterface](https://github.com/SpikeInterface/spikeinterface).

The pipeline includes:

- preprocessing: phase_shift, highpass filter, denoising (bad channel removal + common median reference ("cmr") or highpass spatial filter - "destripe"), and motion estimation (optionally correction)
- spike sorting: with Kilosort2.5
- postprocessing: remove duplicate units, compute amplitudes, spike/unit locations, PCA, correlograms, template similarity, template metrics, and quality metrics
- curation: based on ISI violation ratio, presence ratio, and amplitude cutoff
- unit classification based on pre-trained classifier (noise, MUA, SUA)
- visualization: timeseries, drift maps, and sorting output in sortingview
- export session, subject, and units data to NWB
The pipeline is based on [Nextflow](https://www.nextflow.io/) and it includes the following steps:

- [job-dispatch](https://github.com/AllenNeuralDynamics/aind-ephys-job-dispatch/): generates a list of JSON files to be processed in parallel. Parallelization is performed over multiple probes and multiple shanks (e.g., for NP2-4shank probes). The steps from `preprocessing` to `visualization` are run in parallel.
- [preprocessing](https://github.com/AllenNeuralDynamics/aind-ephys-preprocessing/): phase_shift, highpass filter, denoising (bad channel removal + common median reference ("cmr") or highpass spatial filter - "destripe"), and motion estimation (optionally correction)
- [spike sorting](https://github.com/AllenNeuralDynamics/aind-ephys-spikesort-kilosort25/): with Kilosort2.5
- [postprocessing](https://github.com/AllenNeuralDynamics/aind-ephys-postprocessing/): remove duplicate units, compute amplitudes, spike/unit locations, PCA, correlograms, template similarity, template metrics, and quality metrics
- [curation](https://github.com/AllenNeuralDynamics/aind-ephys-curation/): based on ISI violation ratio, presence ratio, and amplitude cutoff
- [unit classification](https://github.com/AllenNeuralDynamics/aind-ephys-unit-classification/): based on pre-trained classifier (noise, MUA, SUA)
- [visualization](https://github.com/AllenNeuralDynamics/aind-ephys-visualization/): timeseries, drift maps, and sorting output in [figurl](https://github.com/flatironinstitute/figurl/blob/main/README.md)
- [result collection](https://github.com/AllenNeuralDynamics/aind-ephys-result-collector/): this step collects the output of all parallel jobs and copies the output folders to the results folder
- export to NWB: creates NWB output files. Each file can contain multiple streams (e.g., probes), but only a continuous chunk of data (such as an Open Ephys experiment+recording or an NWB `ElectricalSeries`). This step includes additional sub-steps:
- [session and subject](https://github.com/AllenNeuralDynamics/NWB_Packaging_Subject_Capsule)
- [units](https://github.com/AllenNeuralDynamics/NWB_Packaging_Units)

Each step is run in a container and can be deployed on several platforms.
See the [Local deplyment](#local-deployment) and [SLURM deployment](#slurm-deployment) sections for more details.

# Input

Currently, the pipeline supports the following input data types:

- `aind`: data ingestion used at AIND. The input folder must contain an `ecephys` subfolder which in turn includes an `ecephys_clipped` (clipped Open Ephys folder) and an `ecephys_compressed` (compressed traces with Zarr). In addition, JSON file following the [aind-data-schema](https://aind-data-schema.readthedocs.io/en/latest/) are parsed to create processing and NWB metadata.
- `spikeglx`: the input folder should be a SpikeGLX folder. It is recommended to add a `subject.json` and a `data_description.json` following the [aind-data-schema](https://aind-data-schema.readthedocs.io/en/latest/) specification, since these metadata are propagated to the NWB files.
- (WIP) `nwb`: the input folder should contain a single NWB file (both HDF5 and Zarr backend are supported).

For more information on how to select the input mode and set additional parameters,
see the [Local deployment - Additional parameters](#additional-parameters) section.

# Output

The output of the pipeline is saved to the `RESULTS_PATH`.
Since the output is produced using SpikeInterface, it is recommended to go through
[its documentation](https://spikeinterface.readthedocs.io/en/0.100.7/) to understand how to easily
load and interact with the data:

The output includes the following files and folders:

**`preprocessed`**

This folder contains the output of preprocessing, including preprocessed JSON files associated to each stream and
motion folders containing the estimated motion.
The preprocessed JSON files can be used to re-instantiate the recordings, provided that the raw data folder is
mapped to the same location as the input of the pipeline.

In this case, the preprocessed recording can be loaded as a `spikeinterface.BaseRecording` with:
```python
import spikeinterface as si

recording_preprocessed = si.load_extractor("path-to-preprocessed.json", base_folder="path-to-raw-data-parent")
```

The motion folders can be loaded as:
```python
import spikeinterface.preprocessing as spre

motion = spre.load_motion("path-to-motion-folder")
```
They include the `motion`, `temporal_bins`, and `spatial_bins` fields, which can be used to visualize the
estimated motion.

**`spikesorted`**

This folder contains the raw spike sorting outputs from Kilosort2.5 for each stream.

It can be loaded as a `spikeinterface.BaseSorting` with:
```python
import spikeinterface as si

sorting_raw = si.load_extractor("path-to-spikesorted-folder")
```

**`postprocessed`**

This folder contains the output of the post-processing for each stream. It can be loaded as a
`spikeinterface.WaveformExtractor` with:
```python
import spikeinterface as si

waveform_extractor = si.load_waveforms("path-to-postprocessed-folder", with_recording=False)
```

The `waveform_extractor` includes many computed extensions. This example shows how to load some of them:
```python
unit_locations = we.load_extension("unit_locations").get_data()
# unit_locations is a np.array with the estimated locations

qm = we.load_extension("quality_metrics").get_data()
# qm is a pandas.DataFrame with the computed quality metrics
```

**`curated`**

This folder contains the curated spike sorting outputs, after unit deduplication, quality-metric curation
and automatic unit classification.

It can be loaded as a `spikeinterface.BaseSorting` with:
```python
import spikeinterface as si

sorting_curated = si.load_extractor("path-to-curated-folder")
```

The `sorting_curated` object contains the following curation properties (which can be retrieved with
`sorting_curated.get_property(property_name)`):

- `default_qc`: `True` if the unit passes the quality-metric-based curation, `False` otherwise
- `decoder_label`: either `noise`, `MUA` or `SUA`

**`nwb`**

This folder contains the generated NWB files.

**`visualization_output.json`**

This JSON file containes the generated Figurl links for each stream, including a `timeseries` and a `sorting_summary`
view.

**`processing.json`**

This JSON file logs all the processing steps, parameters, and execution times.

**`nextflow`**

All files generated by Nextflow are saved here


# Parameters

Some steps of the pipeline accept additional parameters, that can be passed as follows:

```bash
--{step_name}_args "{args}"
```

The steps that accept additional arguments are:

### `job_dispatch_args`:

```bash
--concatenate Whether to concatenate recordings (segments) or not. Default: False
--input {aind,spikeglx,nwb}
Which 'loader' to use. Default 'aind'
```

- `aind`: data ingestion used at AIND. The `DATA_PATH` must contain an `ecephys` subfolder which in turn includes an `ecephys_clipped` (clipped Open Ephys folder) and an `ecephys_compressed` (compressed traces with Zarr). In addition, JSON file following the [aind-data-schema](https://aind-data-schema.readthedocs.io/en/latest/) are parsed to create processing and NWB metadata.
- `spikeglx`: the `DATA_PATH` should contain a SpikeGLX saved folder. It is recommended to add a `subject.json` and a `data_description.json` following the [aind-data-schema](https://aind-data-schema.readthedocs.io/en/latest/) specification, since these metadata are propagated to the NWB files.
- (WIP) `nwb`: the `DATA_PATH` should contain an NWB file (both HDF5 and Zarr backend are supported).

### `preprocessing_args`:

```bash
--debug Whether to run in DEBUG mode
--denoising {cmr,destripe}
Which denoising strategy to use. Can be 'cmr' or 'destripe'. Default 'cmr'
--no-remove-out-channels
Whether to remove out channels
--no-remove-bad-channels
Whether to remove bad channels
--max-bad-channel-fraction MAX_BAD_CHANNEL_FRACTION
Maximum fraction of bad channels to remove. If more than this fraction, processing is skipped
--motion {skip,compute,apply}
How to deal with motion correction. Can be 'skip', 'compute', or 'apply'. Default 'compute'
--motion-preset {nonrigid_accurate,kilosort_like,nonrigid_fast_and_accurate}
What motion preset to use. Can be 'nonrigid_accurate', 'kilosort_like', or 'nonrigid_fast_and_accurate'. Default "nonrigid_fast_and_accurate"
--debug-duration DEBUG_DURATION
Duration of clipped recording in debug mode. Default is 30 seconds. Only used if debug is enabled
```
### `nwb_subject_args`:
```bash
--backend {hdf5,zarr}
NWB backend. It can be either 'hdf5' or 'zarr'. Default 'zarr'

```
In Nextflow, the The `-resume` argument enables the caching mechanism.
# Local deployment
## Requirements
To deploy locally, you need to install:
- `nextflow`
- `docker`
- `figurl` (optional, for cloud visualization)
Please checkout the [Nextflow](https://www.nextflow.io/docs/latest/install.html) and [Docker](https://docs.docker.com/engine/install/) installation instructions.
To install and configure `figurl`, you need to follow these instructions to setup [`kachery-cloud`]():
1. On your local machine, run `pip install kachery-cloud`
2. Run `kachery-cloud-init`, open the printed URL link and login with your GitHub account
3. Go to `https://kachery-gateway.figurl.org/?zone=default` and create a new Client:
- Click on the `Client` tab on the left
- Add a new client (you can choose any label)
4. Set kachery-cloud credentials on your local machine:
- Click on the newly created client
- Set the `KACHERY_CLOUD_CLIENT_ID` environment variable to the `Client ID` content
- Set the `KACHERY_CLOUD_PRIVATE_KEY` environment variable to the `Ptivate Key` content
- (optional) If using a custom Kachery zone, set `KACHERY_ZONE` environment variable to your zone
By default, `kachery-cloud` will use the `default` zone, which is hosted by the Flatiron institute.
If you plan to use this service extensively, it is recommended to
[create your own kachery zone](https://github.com/flatironinstitute/kachery-cloud/blob/main/doc/create_kachery_zone.md).
## Run
Clone this repo (`git clone https://github.com/AllenNeuralDynamics/aind-ephys-pipeline-kilosort25.git`) and go to the
`pipeline` folder. You will find a `main_local.nf`. This nextflow script is accompanied by the
`nextflow_local.config` and can run on local workstations/machines.
To invoke the pipeline you can run the following command:
```bash
NXF_VER=22.10.8 DATA_PATH=$PWD/../data RESULTS_PATH=$PWD/../results \
nextflow -C nextflow_local.config run main_local.nf \
-log $RESULTS_PATH/nextflow/nextflow.log \
--n_jobs 8 -resume
```
The `DATA_PATH` specifies the folder where the input files are located.
The `RESULT_PATH` points to the output folder, where the data will be saved.
The `--n_jobs` argument specifies the number of parallel jobs to run.
Additional parameters can be passed as described in the [Parameters](#parameters) section.
## Example run command
As an example, here is how to run the pipeline on a SpikeGLX dataset in debug mode
on a 120-second snippet of the recording with 16 jobs:
```bash
NXF_VER=22.10.8 DATA_PATH=path/to/data_spikeglx RESULTS_PATH=path/to/results_spikeglx \
nextflow -C nextflow_local.config run main_local.nf --n_jobs 16 \
--job_dispatch_args "--input spikeglx" --preprocessing_args "--debug --debug-duration 120"
```
# SLURM deployment
To deploy on a SLURM cluster, you need to have access to a SLURM cluster and have the
[Nextflow](https://www.nextflow.io/docs/latest/install.html) and Singularity/Apptainer installed.
To use Figurl cloud visualizations, follow the same steps descrived in the
[Local deployment - Requirements](#requirements) section and set the KACHERY environment variables.
Then, you can submit the pipeline to the cluster similarly to the Local deplyment,
but wrapping the command into a script that can be launched with `sbatch`.
You can use the `slurm_submit.sh` script as a template to submit the pipeline to your cluster.
```bash
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=4GB
#SBATCH --time=2:00:00
### change {your-partition} to the partition/queue on your cluster
#SBATCH --partition={your-partition}
# modify this section to make the nextflow command available to your environment
# e.g., using a conda environment with nextflow installed
conda activate env_nf
PIPELINE_PATH="path-to-your-cloned-repo"
DATA_PATH="path-to-data-folder"
RESULTS_PATH="path-to-results-folder"
WORKDIR="path-to-large-workdir"
NXF_VER=22.10.8 DATA_PATH=$DATA_PATH RESULTS_PATH=$RESULTS_PATH nextflow \
-C $PIPELINE_PATH/pipeline/nextflow_slurm.config \
-log $RESULTS_PATH/nextflow/nextflow.log \
run $PIPELINE_PATH/pipeline/main_slurm.nf \
-work-dir $WORKDIR \
--preprocessing_args "--debug --debug-duration 120" \ # additional parameters
-resume
```
You should change the `--partition` parameter to match the partition you want to use on your cluster and point to the correct paths and parameters.
Then, you can submit the script to the cluster with:
```bash
sbatch slurm_submit.sh
```
# Create a custom layer for data ingestion
The default job-dispatch step only supports loading data
from AIND folders, SpikeGLX folders, and NWB files.
To ingest other types of data, you can create a similar repo and modify the way that the job list is created
(see the [job dispatch README](https://github.com/AllenNeuralDynamics/aind-ephys-job-dispatch/blob/main/README.md) for more details).
Then you can create a modified `main_local-slurm.nf` `job_dispatch` process to point to your custom job dispatch repo.
41 changes: 41 additions & 0 deletions environment/Dockerfile_base
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
FROM continuumio/miniconda3:23.9.0-0

ARG DEBIAN_FRONTEND=noninteractive


RUN apt-get update \
&& apt-get install -y --no-install-recommends \
build-essential \
git \
fonts-freefont-ttf=20120503-10 \
&& rm -rf /var/lib/apt/lists/*

# correct mapping to make libvips work
ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libffi.so.7

# install libvips
RUN apt-get update \
&& apt-get install -y libvips libvips-dev libvips-tools libtiff5-dev

# install default fonts
RUN apt-get install -y fonts-freefont-ttf

# needed for motion estimation
RUN pip install --no-cache-dir torch==2.2.0

RUN pip install --no-cache-dir \
aind-data-schema==0.38.0 \
pyvips==2.2.1 \
wavpack-numcodecs==0.1.5 \
pynwb==2.8.0 \
hdmf-zarr==0.8.0 \
spikeinterface[full,widgets]==0.100.7

RUN pip install --no-cache-dir --no-deps aind-metadata-upgrader==0.0.8

# NEO installation from source with a SpikeGLX fix
RUN pip uninstall -y neo && \
git clone https://github.com/alejoe91/python-neo.git && \
cd python-neo && git checkout cf543aa7b85124193a8d1cf726a92de54b360026 && \
pip install --no-cache-dir . && \
cd ..
22 changes: 22 additions & 0 deletions environment/Dockerfile_nwb
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FROM continuumio/miniconda3:23.9.0-0

ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update \
&& apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*

RUN pip install -U --no-cache-dir \
hdmf-zarr==0.8.0 \
pynwb==2.8.0 \
neuroconv==0.4.10 \
zarr==2.17.2 \
wavpack-numcodecs==0.1.5 \
spikeinterface[full]==0.100.7

RUN pip uninstall -y neo && \
git clone https://github.com/alejoe91/python-neo.git && \
cd python-neo && git checkout cf543aa7b85124193a8d1cf726a92de54b360026 && \
pip install --no-cache-dir . && \
cd ..
Loading

0 comments on commit a80dd67

Please sign in to comment.