scran, in Python

Overview

The scranpy package provides Python bindings to the single-cell analysis methods in the libscran C++ libraries. It performs the standard steps in a typical single-cell analysis including quality control, normalization, feature selection, dimensionality reduction, clustering and marker detection. This package is effectively a mirror of its counterparts in Javascript (scran.js) and R (scrapper), which are based on the same underlying C++ libraries and concepts.

Quick start

Let's fetch a dataset from the scrnaseq package:

import scrnaseq 
sce = scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14", realize_assays=True)
print(sce)
## class: SingleCellExperiment
## dimensions: (20006, 3005)
## assays(1): ['counts']
## row_data columns(1): ['featureType']
## row_names(20006): ['Tspan12', 'Tshz1', 'Fnbp1l', ..., 'mt-Rnr2', 'mt-Rnr1', 'mt-Nd4l']
## column_data columns(9): ['tissue', 'group #', 'total mRNA mol', 'well', 'sex', 'age', 'diameter', 'level1class', 'level2class']
## column_names(3005): ['1772071015_C02', '1772071017_G12', '1772071017_A05', ..., '1772063068_D01', '1772066098_A12', '1772058148_F03']
## main_experiment_name: gene
## reduced_dims(0): []
## alternative_experiments(2): ['repeat', 'ERCC']
## row_pairs(0): []
## column_pairs(0): []
## metadata(0):

Then we call scranpy's analyze() functions, with some additional information about the mitochondrial subset for quality control purposes.

import scranpy
results = scranpy.analyze(
    sce,
    rna_subsets = {
        "mito": [name.startswith("mt-") for name in sce.get_row_names()]
    }
)

This will perform all of the usual steps for a routine single-cell analysis, as described in Bioconductor's Orchestrating single cell analysis book. It returns an object containing clusters, t-SNEs, UMAPs, marker genes, and so on:

print(results.clusters)
## [ 0  0  0 ...  6  6 13]

print(results.tsne)
## [[  6.24189264   6.12559936   5.41776875 ...  20.07822751  18.25022123
##    14.78338538]
##  [-28.82249121 -28.18510674 -28.92849565 ...   7.73694327   3.70750309
##     7.13103324]]

print(results.umap)
## [[ 9.84396648  9.73148155  9.83376408 ... -6.64551735 -5.74155378
##   -4.41887522]
##  [-1.26350224 -1.16540933 -1.13979638 ... -5.63315582 -4.83151293
##   -6.02484226]]

first_markers = results.rna_markers.to_biocframes(summaries=["median"])[0]
first_markers.set_row_names(results.rna_row_names, in_place=True)
print(first_markers)
## BiocFrame with 20006 rows and 6 columns
##                        mean            detected     cohens_d_median          auc_median   delta_mean_median delta_detected_median
##          <ndarray[float64]>  <ndarray[float64]>  <ndarray[float64]>  <ndarray[float64]>  <ndarray[float64]>    <ndarray[float64]>
## Tspan12 0.35759151503119846  0.3157894736842105 0.31138667468315545  0.5989624247185128 0.31138667468315545   0.31138667468315545
##   Tshz1  0.5997779968274406 0.41776315789473684 0.36865087228075244  0.6031352215283973 0.36865087228075244   0.36865087228075244
##  Fnbp1l  1.1660581606996154              0.6875  0.7644031115934953  0.6905056759545924  0.7644031115934953    0.7644031115934953
##                         ...                 ...                 ...                 ...                 ...                   ...
## mt-Rnr2   6.966227511583628  0.9967105263157895 -0.7666238430581961  0.2928277982073087 -0.7666238430581961   -0.7666238430581961
## mt-Rnr1   4.914541788016454  0.9769736842105263 -0.4847704371628273  0.3834708208344696 -0.4847704371628273   -0.4847704371628273
## mt-Nd4l  3.2901199968427246  0.9342105263157894 -0.5903983282435646 0.30724666142969365 -0.5903983282435646   -0.5903983282435646

Users can also convert the results into a SingleCellExperiment for easier manipulation:

print(results.to_singlecellexperiment())
## class: SingleCellExperiment
## dimensions: (20006, 2874)
## assays(2): ['filtered', 'normalized']
## row_data columns(5): ['mean', 'variance', 'fitted', 'residual', 'is_highly_variable']
## row_names(20006): ['Tspan12', 'Tshz1', 'Fnbp1l', ..., 'mt-Rnr2', 'mt-Rnr1', 'mt-Nd4l']
## column_data columns(5): ['sum', 'detected', 'subset_proportion_mito', 'size_factors', 'clusters']
## column_names(2874): ['1772071015_C02', '1772071017_G12', '1772071017_A05', ..., '1772066097_D04', '1772063068_D01', '1772066098_A12']
## main_experiment_name:
## reduced_dims(3): ['pca', 'tsne', 'umap']
## alternative_experiments(0): []
## row_pairs(0): []
## column_pairs(0): []
## metadata(0):

Check out the reference documentation for more details.

Multiple batches

To demonstrate, let's grab two pancreas datasets from the scrnaseq package. Each dataset represents a separate batch of cells generated in different studies.

import scrnaseq 
gsce = scrnaseq.fetch_dataset("grun-pancreas-2016", "2023-12-14", realize_assays=True)
msce = scrnaseq.fetch_dataset("muraro-pancreas-2016", "2023-12-19", realize_assays=True)

They don't have the same features, so we'll just take the intersection of their row names before combining them into a single SingleCellExperiment object:

import biocutils
common = biocutils.intersect(gsce.get_row_names(), msce.get_row_names())
combined = biocutils.relaxed_combine_columns(
    gsce[biocutils.match(common, gsce.get_row_names()), :],
    msce[biocutils.match(common, msce.get_row_names()), :]
)
print(combined)
## class: SingleCellExperiment
## dimensions: (18499, 4800)
## assays(1): ['counts']
## row_data columns(2): ['symbol', 'chr']
## row_names(18499): ['A1BG-AS1__chr19', 'A1BG__chr19', 'A1CF__chr10', ..., 'ZYX__chr7', 'ZZEF1__chr17', 'ZZZ3__chr1']
## column_data columns(4): ['donor', 'sample', 'label', 'plate']
## column_names(4800): ['D2ex_1', 'D2ex_2', 'D2ex_3', ..., 'D30-8_94', 'D30-8_95', 'D30-8_96']
## main_experiment_name: endogenous
## reduced_dims(0): []
## alternative_experiments(0): []
## row_pairs(0): []
## column_pairs(0): []
## metadata(0):

We can now perform a batch-aware analysis, where the blocking factor is also used in relevant functions to avoid problems with batch effects.

import scranpy
block = ["grun"] * gsce.shape[1] + ["muraro"] * msce.shape[1]
results = scranpy.analyze(combined, block=block) # no mitochondrial genes in this case...

This yields mostly the same set of results as before, but with an extra MNN-corrected embedding for clustering, visualization, etc.

results.mnn_corrected.corrected
## array([[-1.87690275e+01, -2.20133721e+01, -2.01364711e+01, ...,
##          1.60988874e+01, -2.10494187e+01, -9.41325421e+00],
##        [ 9.95069366e+00,  1.12168142e+01,  1.40745981e+01, ...,
##         -5.63689417e+00, -1.46003926e+01, -4.02325382e+00],
##        [ 1.17917046e+01,  8.40756681e+00,  1.24557851e+01, ...,
##          3.65281722e+00, -1.13280613e+01, -1.12939448e+01],
##        ...,
##        [-4.20177077e+00,  3.64443391e-01,  1.13834851e+00, ...,
##          1.43898885e-02, -2.24228270e+00, -5.89749453e-01],
##        [-2.49456306e+00,  6.82624452e-01,  2.30363317e+00, ...,
##          1.09145269e+00,  3.17776365e+00,  8.27058276e-01],
##        [-2.03562222e+00,  2.04701389e+00,  5.64774034e-01, ...,
##          4.31078606e-01, -4.02375136e-01,  8.52493315e-01]],
##       shape=(25, 3984))

Multiple modalities

Let's grab a 10X Genomics immune profiling dataset (see here), which contains count data for the entire transcriptome and targeted proteins:

import singlecellexperiment
sce = singlecellexperiment.read_tenx_h5("immune_3.0.0-tenx.h5", realize_assays=True)
sce.set_row_names(sce.get_row_data().get_column("id"), in_place=True)
## class: SingleCellExperiment
## dimensions: (33555, 8258)
## assays(1): ['counts']
## row_data columns(7): ['feature_type', 'genome', 'id', 'name', 'pattern', 'read', 'sequence']
## row_names(33555): ['ENSG00000243485', 'ENSG00000237613', 'ENSG00000186092', ..., 'IgG2b', 'CD127', 'CD15']
## column_data columns(1): ['barcodes']
## column_names(0):
## main_experiment_name:
## reduced_dims(0): []
## alternative_experiments(0): []
## row_pairs(0): []
## column_pairs(0): []
## metadata(0):

We split it to genes and ADTs:

feattypes = sce.get_row_data().get_column("feature_type")
gene_data = sce[[x == "Gene Expression" for x in feattypes],:]
adt_data = sce[[x == "Antibody Capture" for x in feattypes],:]

And now we can run the analysis:

import scranpy
results = scranpy.analyze(
    gene_data,
    adt_x = adt_data, 
    rna_subsets = { 
        "mito": [n.startswith("MT-") for n in gene_data.get_row_data().get_column("name")]
    },
    adt_subsets = {
        "igg": [n.startswith("IgG") for n in adt_data.get_row_data().get_column("name")]
    }
)

This returns ADT-specific results in the relevant fields, as well as a set of combined PCs for use in clustering, visualization, etc.

print(results.adt_size_factors)
## [0.79359408 0.79410332 0.89536413 ... 0.79207839 0.66492723 0.76847637]

print(results.combined_pca.combined)
## [[-9.97603155e+00 -1.04045057e+01 -1.26408576e+01 ... -1.29361354e+01
##   -1.09887392e+01 -1.08070608e+01]
##  [ 7.47726554e+00  6.77629373e+00  1.78091509e+00 ...  2.22256539e+00
##    5.96667219e+00  7.10437993e+00]
##  [-2.63898263e+00 -1.24485522e+00  5.51002546e+00 ...  5.21037673e+00
##   -5.54233035e+00 -3.38828724e+00]
##  ...
##  [-2.04699441e-01 -4.38991650e-01 -2.87170731e+00 ...  2.36527395e+00
##    7.05969255e-01 -2.46180209e-01]
##  [ 4.75688909e-01 -1.54557081e-01 -1.30053159e+00 ...  2.81492567e+00
##    1.21607502e+00 -3.12194853e-01]
##  [ 8.56575012e-02  8.74924626e-03 -7.17362957e-04 ...  1.65769854e-01
##    1.73927253e-01  5.04057044e-02]]

second_markers = results.adt_markers.to_biocframes(summaries=["min_rank"])[1]
second_markers.set_row_names(results.adt_row_names, in_place=True)
print(second_markers)
## BiocFrame with 17 rows and 6 columns
##                      mean           detected cohens_d_min_rank      auc_min_rank delta_mean_min_rank delta_detected_min_rank
##        <ndarray[float64]> <ndarray[float64]> <ndarray[uint32]> <ndarray[uint32]>   <ndarray[uint32]>       <ndarray[uint32]>
##    CD3  11.04397358391642                1.0                 1                 1                   1                       1
##   CD19  4.072383130863625                1.0                 4                 4                   4                       4
## CD45RA 10.481785289114054                1.0                 1                 1                   1                       1
##                       ...                ...               ...               ...                 ...                     ...
##  IgG2b 2.8690172565558263 0.9911190053285968                 6                 5                   6                       6
##  CD127  6.258223461814724                1.0                 2                 2                   2                       2
##   CD15  5.366264191077669                1.0                 4                 4                   4                       4

Customizing the analysis

Most parameters can be changed by modifying the relevant arguments in analyze(). For example:

import scrnaseq 
sce = scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14", realize_assays=True)
is_mito = [name.startswith("mt-") for name in sce.get_row_names()]

import scranpy
results = scranpy.analyze(
    sce,
    rna_subsets = {
        "mito": is_mito
    },
    build_snn_graph_options = {
        "num_neighbors": 10
    },
    cluster_graph_options = {
        "multilevel_resolution": 2
    },
    run_pca_options = {
        "number": 15
    },
    run_tsne_options = {
        "perplexity": 25
    },
    run_umap_options = {
        "min_dist": 0.05
    }
)

For finer control, users can call each step individually via lower-level functions. A typical RNA analysis might be implemented as:

counts = sce.assay(0)
qcmetrics = scranpy.compute_rna_qc_metrics(counts, subsets=is_mito)
thresholds = scranpy.suggest_rna_qc_thresholds(qcmetrics)
filter = scranpy.filter_rna_qc_metrics(thresholds, metrics)

import delayedarray # avoid an actual copy of the matrix.
filtered = delayedarray.DelayedArray(rna_x)[:,filter]

sf = scranpy.center_size_factors(qcmetrics.sum[filter])
normalized = scranpy.normalize_counts(filtered, sf)

vardf = scranpy.model_gene_variances(normalized)
hvgs = scranpy.choose_highly_variable_genes(vardf.residual)
pca = scranpy.run_pca(normalized[hvgs,:])

nn_out = scranpy.run_all_neighbor_steps(pca.components)
clusters = nn_out.cluster_graph.membership
markers = scranpy.score_markers(normalized, groups=clusters)

Check out analyze.py for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
.github/workflows		.github/workflows
docs		docs
lib		lib
src/scranpy		src/scranpy
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
AUTHORS.md		AUTHORS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scran, in Python

Overview

Quick start

Multiple batches

Multiple modalities

Customizing the analysis

About

Releases

Packages

Contributors 3

Languages

License

libscran/scranpy

Folders and files

Latest commit

History

Repository files navigation

scran, in Python

Overview

Quick start

Multiple batches

Multiple modalities

Customizing the analysis

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages