Skip to content

libscran/scranpy

Repository files navigation

Project generated with PyScaffold PyPI-Server Downloads Unit tests

scran, in Python

Overview

The scranpy package provides Python bindings to the single-cell analysis methods in the libscran C++ libraries. It performs the standard steps in a typical single-cell analysis including quality control, normalization, feature selection, dimensionality reduction, clustering and marker detection. This package is effectively a mirror of its counterparts in Javascript (scran.js) and R (scrapper), which are based on the same underlying C++ libraries and concepts.

Quick start

Let's fetch a dataset from the scrnaseq package:

import scrnaseq 
sce = scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14", realize_assays=True)
print(sce)
## class: SingleCellExperiment
## dimensions: (20006, 3005)
## assays(1): ['counts']
## row_data columns(1): ['featureType']
## row_names(20006): ['Tspan12', 'Tshz1', 'Fnbp1l', ..., 'mt-Rnr2', 'mt-Rnr1', 'mt-Nd4l']
## column_data columns(9): ['tissue', 'group #', 'total mRNA mol', 'well', 'sex', 'age', 'diameter', 'level1class', 'level2class']
## column_names(3005): ['1772071015_C02', '1772071017_G12', '1772071017_A05', ..., '1772063068_D01', '1772066098_A12', '1772058148_F03']
## main_experiment_name: gene
## reduced_dims(0): []
## alternative_experiments(2): ['repeat', 'ERCC']
## row_pairs(0): []
## column_pairs(0): []
## metadata(0):

Then we call scranpy's analyze() functions, with some additional information about the mitochondrial subset for quality control purposes.

import scranpy
results = scranpy.analyze(
    sce,
    rna_subsets = {
        "mito": [name.startswith("mt-") for name in sce.get_row_names()]
    }
)

This will perform all of the usual steps for a routine single-cell analysis, as described in Bioconductor's Orchestrating single cell analysis book. It returns an object containing clusters, t-SNEs, UMAPs, marker genes, and so on:

print(results.clusters)
## [ 0  0  0 ...  6  6 13]

print(results.tsne)
## [[  6.24189264   6.12559936   5.41776875 ...  20.07822751  18.25022123
##    14.78338538]
##  [-28.82249121 -28.18510674 -28.92849565 ...   7.73694327   3.70750309
##     7.13103324]]

print(results.umap)
## [[ 9.84396648  9.73148155  9.83376408 ... -6.64551735 -5.74155378
##   -4.41887522]
##  [-1.26350224 -1.16540933 -1.13979638 ... -5.63315582 -4.83151293
##   -6.02484226]]

first_markers = results.rna_markers.to_biocframes(summaries=["median"])[0]
first_markers.set_row_names(results.rna_row_names, in_place=True)
print(first_markers)
## BiocFrame with 20006 rows and 6 columns
##                        mean            detected     cohens_d_median          auc_median   delta_mean_median delta_detected_median
##          <ndarray[float64]>  <ndarray[float64]>  <ndarray[float64]>  <ndarray[float64]>  <ndarray[float64]>    <ndarray[float64]>
## Tspan12 0.35759151503119846  0.3157894736842105 0.31138667468315545  0.5989624247185128 0.31138667468315545   0.31138667468315545
##   Tshz1  0.5997779968274406 0.41776315789473684 0.36865087228075244  0.6031352215283973 0.36865087228075244   0.36865087228075244
##  Fnbp1l  1.1660581606996154              0.6875  0.7644031115934953  0.6905056759545924  0.7644031115934953    0.7644031115934953
##                         ...                 ...                 ...                 ...                 ...                   ...
## mt-Rnr2   6.966227511583628  0.9967105263157895 -0.7666238430581961  0.2928277982073087 -0.7666238430581961   -0.7666238430581961
## mt-Rnr1   4.914541788016454  0.9769736842105263 -0.4847704371628273  0.3834708208344696 -0.4847704371628273   -0.4847704371628273
## mt-Nd4l  3.2901199968427246  0.9342105263157894 -0.5903983282435646 0.30724666142969365 -0.5903983282435646   -0.5903983282435646

Users can also convert the results into a SingleCellExperiment for easier manipulation:

print(results.to_singlecellexperiment())
## class: SingleCellExperiment
## dimensions: (20006, 2874)
## assays(2): ['filtered', 'normalized']
## row_data columns(5): ['mean', 'variance', 'fitted', 'residual', 'is_highly_variable']
## row_names(20006): ['Tspan12', 'Tshz1', 'Fnbp1l', ..., 'mt-Rnr2', 'mt-Rnr1', 'mt-Nd4l']
## column_data columns(5): ['sum', 'detected', 'subset_proportion_mito', 'size_factors', 'clusters']
## column_names(2874): ['1772071015_C02', '1772071017_G12', '1772071017_A05', ..., '1772066097_D04', '1772063068_D01', '1772066098_A12']
## main_experiment_name:
## reduced_dims(3): ['pca', 'tsne', 'umap']
## alternative_experiments(0): []
## row_pairs(0): []
## column_pairs(0): []
## metadata(0):

Check out the reference documentation for more details.

Multiple batches

To demonstrate, let's grab two pancreas datasets from the scrnaseq package. Each dataset represents a separate batch of cells generated in different studies.

import scrnaseq 
gsce = scrnaseq.fetch_dataset("grun-pancreas-2016", "2023-12-14", realize_assays=True)
msce = scrnaseq.fetch_dataset("muraro-pancreas-2016", "2023-12-19", realize_assays=True)

They don't have the same features, so we'll just take the intersection of their row names before combining them into a single SingleCellExperiment object:

import biocutils
common = biocutils.intersect(gsce.get_row_names(), msce.get_row_names())
combined = biocutils.relaxed_combine_columns(
    gsce[biocutils.match(common, gsce.get_row_names()), :],
    msce[biocutils.match(common, msce.get_row_names()), :]
)
print(combined)
## class: SingleCellExperiment
## dimensions: (18499, 4800)
## assays(1): ['counts']
## row_data columns(2): ['symbol', 'chr']
## row_names(18499): ['A1BG-AS1__chr19', 'A1BG__chr19', 'A1CF__chr10', ..., 'ZYX__chr7', 'ZZEF1__chr17', 'ZZZ3__chr1']
## column_data columns(4): ['donor', 'sample', 'label', 'plate']
## column_names(4800): ['D2ex_1', 'D2ex_2', 'D2ex_3', ..., 'D30-8_94', 'D30-8_95', 'D30-8_96']
## main_experiment_name: endogenous
## reduced_dims(0): []
## alternative_experiments(0): []
## row_pairs(0): []
## column_pairs(0): []
## metadata(0):

We can now perform a batch-aware analysis, where the blocking factor is also used in relevant functions to avoid problems with batch effects.

import scranpy
block = ["grun"] * gsce.shape[1] + ["muraro"] * msce.shape[1]
results = scranpy.analyze(combined, block=block) # no mitochondrial genes in this case...

This yields mostly the same set of results as before, but with an extra MNN-corrected embedding for clustering, visualization, etc.

results.mnn_corrected.corrected
## array([[-1.87690275e+01, -2.20133721e+01, -2.01364711e+01, ...,
##          1.60988874e+01, -2.10494187e+01, -9.41325421e+00],
##        [ 9.95069366e+00,  1.12168142e+01,  1.40745981e+01, ...,
##         -5.63689417e+00, -1.46003926e+01, -4.02325382e+00],
##        [ 1.17917046e+01,  8.40756681e+00,  1.24557851e+01, ...,
##          3.65281722e+00, -1.13280613e+01, -1.12939448e+01],
##        ...,
##        [-4.20177077e+00,  3.64443391e-01,  1.13834851e+00, ...,
##          1.43898885e-02, -2.24228270e+00, -5.89749453e-01],
##        [-2.49456306e+00,  6.82624452e-01,  2.30363317e+00, ...,
##          1.09145269e+00,  3.17776365e+00,  8.27058276e-01],
##        [-2.03562222e+00,  2.04701389e+00,  5.64774034e-01, ...,
##          4.31078606e-01, -4.02375136e-01,  8.52493315e-01]],
##       shape=(25, 3984))

Multiple modalities

Let's grab a 10X Genomics immune profiling dataset (see here), which contains count data for the entire transcriptome and targeted proteins:

import singlecellexperiment
sce = singlecellexperiment.read_tenx_h5("immune_3.0.0-tenx.h5", realize_assays=True)
sce.set_row_names(sce.get_row_data().get_column("id"), in_place=True)
## class: SingleCellExperiment
## dimensions: (33555, 8258)
## assays(1): ['counts']
## row_data columns(7): ['feature_type', 'genome', 'id', 'name', 'pattern', 'read', 'sequence']
## row_names(33555): ['ENSG00000243485', 'ENSG00000237613', 'ENSG00000186092', ..., 'IgG2b', 'CD127', 'CD15']
## column_data columns(1): ['barcodes']
## column_names(0):
## main_experiment_name:
## reduced_dims(0): []
## alternative_experiments(0): []
## row_pairs(0): []
## column_pairs(0): []
## metadata(0):

We split it to genes and ADTs:

feattypes = sce.get_row_data().get_column("feature_type")
gene_data = sce[[x == "Gene Expression" for x in feattypes],:]
adt_data = sce[[x == "Antibody Capture" for x in feattypes],:]

And now we can run the analysis:

import scranpy
results = scranpy.analyze(
    gene_data,
    adt_x = adt_data, 
    rna_subsets = { 
        "mito": [n.startswith("MT-") for n in gene_data.get_row_data().get_column("name")]
    },
    adt_subsets = {
        "igg": [n.startswith("IgG") for n in adt_data.get_row_data().get_column("name")]
    }
)

This returns ADT-specific results in the relevant fields, as well as a set of combined PCs for use in clustering, visualization, etc.

print(results.adt_size_factors)
## [0.79359408 0.79410332 0.89536413 ... 0.79207839 0.66492723 0.76847637]

print(results.combined_pca.combined)
## [[-9.97603155e+00 -1.04045057e+01 -1.26408576e+01 ... -1.29361354e+01
##   -1.09887392e+01 -1.08070608e+01]
##  [ 7.47726554e+00  6.77629373e+00  1.78091509e+00 ...  2.22256539e+00
##    5.96667219e+00  7.10437993e+00]
##  [-2.63898263e+00 -1.24485522e+00  5.51002546e+00 ...  5.21037673e+00
##   -5.54233035e+00 -3.38828724e+00]
##  ...
##  [-2.04699441e-01 -4.38991650e-01 -2.87170731e+00 ...  2.36527395e+00
##    7.05969255e-01 -2.46180209e-01]
##  [ 4.75688909e-01 -1.54557081e-01 -1.30053159e+00 ...  2.81492567e+00
##    1.21607502e+00 -3.12194853e-01]
##  [ 8.56575012e-02  8.74924626e-03 -7.17362957e-04 ...  1.65769854e-01
##    1.73927253e-01  5.04057044e-02]]

second_markers = results.adt_markers.to_biocframes(summaries=["min_rank"])[1]
second_markers.set_row_names(results.adt_row_names, in_place=True)
print(second_markers)
## BiocFrame with 17 rows and 6 columns
##                      mean           detected cohens_d_min_rank      auc_min_rank delta_mean_min_rank delta_detected_min_rank
##        <ndarray[float64]> <ndarray[float64]> <ndarray[uint32]> <ndarray[uint32]>   <ndarray[uint32]>       <ndarray[uint32]>
##    CD3  11.04397358391642                1.0                 1                 1                   1                       1
##   CD19  4.072383130863625                1.0                 4                 4                   4                       4
## CD45RA 10.481785289114054                1.0                 1                 1                   1                       1
##                       ...                ...               ...               ...                 ...                     ...
##  IgG2b 2.8690172565558263 0.9911190053285968                 6                 5                   6                       6
##  CD127  6.258223461814724                1.0                 2                 2                   2                       2
##   CD15  5.366264191077669                1.0                 4                 4                   4                       4

Customizing the analysis

Most parameters can be changed by modifying the relevant arguments in analyze(). For example:

import scrnaseq 
sce = scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14", realize_assays=True)
is_mito = [name.startswith("mt-") for name in sce.get_row_names()]

import scranpy
results = scranpy.analyze(
    sce,
    rna_subsets = {
        "mito": is_mito
    },
    build_snn_graph_options = {
        "num_neighbors": 10
    },
    cluster_graph_options = {
        "multilevel_resolution": 2
    },
    run_pca_options = {
        "number": 15
    },
    run_tsne_options = {
        "perplexity": 25
    },
    run_umap_options = {
        "min_dist": 0.05
    }
)

For finer control, users can call each step individually via lower-level functions. A typical RNA analysis might be implemented as:

counts = sce.assay(0)
qcmetrics = scranpy.compute_rna_qc_metrics(counts, subsets=is_mito)
thresholds = scranpy.suggest_rna_qc_thresholds(qcmetrics)
filter = scranpy.filter_rna_qc_metrics(thresholds, metrics)

import delayedarray # avoid an actual copy of the matrix.
filtered = delayedarray.DelayedArray(rna_x)[:,filter]

sf = scranpy.center_size_factors(qcmetrics.sum[filter])
normalized = scranpy.normalize_counts(filtered, sf)

vardf = scranpy.model_gene_variances(normalized)
hvgs = scranpy.choose_highly_variable_genes(vardf.residual)
pca = scranpy.run_pca(normalized[hvgs,:])

nn_out = scranpy.run_all_neighbor_steps(pca.components)
clusters = nn_out.cluster_graph.membership
markers = scranpy.score_markers(normalized, groups=clusters)

Check out analyze.py for more details.