Speed issue with SingleR v2.8.0 in BioC 3.20 #292

ycl6 · 2025-01-03T12:41:44Z

Hi,

I am testing my single-cell workflow using the new R-4.4 (v4.4.2) these couple of days and noticed SingleR (v2.8.0) is running a lot slower compared with my other environment that has R-4.3 and SingleR v2.4.1.

From the commit history, I don't see what might have caused the drop in speed between the two versions.
I am wondering if anyone has any insight into this and how I can run the new version of SingleR as fast as before (is there any parameter I can tweak)?

Below is how I run SingleR and my sce has 9,188 genes and 16,601 cells.

library(SingleR)

bpen <- celldex::BlueprintEncodeData()

sce <- readRDS("sce.rds")

system.time({
x <- SingleR(test = sce, assay.type.test = "logcounts", de.method = "classic", 
             ref = list(BP = bpen), labels = list(bpen$label.main), num.threads = 2)
})

table(x$labels)

With R-4.3 and SingleR v2.4.1, I have

   user  system elapsed 
192.798  29.163 199.104 

     B-cells CD4+ T-cells CD8+ T-cells  Eosinophils Erythrocytes          HSC    Monocytes 
        1503         7341         3475            6            1           10         3104 
    NK cells 
        1161

and with R-4.4 and SingleR v2.8.0, I have

   user  system elapsed 
384.427   8.220 283.556 

     B-cells CD4+ T-cells CD8+ T-cells Erythrocytes          HSC    Monocytes     NK cells 
        1506         7336         3500            1           10         3107         1141

When I change num.threads to 4 (no speed gain in both, new version still runs slower):

With R-4.3 and SingleR v2.4.1, I have

   user  system elapsed 
187.586  22.508 187.168

and with R-4.4 and SingleR v2.8.0, I have

   user  system elapsed 
380.693   8.094 280.565

p.s. The predicted cell type labels are a little different, but I am guessing its due to updates with the celldex package. Although, I did notice the assays structure of the reference I am using (BlueprintEncodeData) is a little different.

This is celldex_1.12.0 (loaded and attached):

> str(bpen)
Formal class 'SummarizedExperiment' [package "SummarizedExperiment"] with 5 slots
  ..@ colData        :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. ..@ rownames       : chr [1:259] "mature.neutrophil" "CD14.positive..CD16.negative.classical.monocyte" "mature.neutrophil.1" "megakaryocyte.erythroid.progenitor.cell" ...
  .. .. ..@ nrows          : int 259
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  .. .. ..@ listData       :List of 3
  .. .. .. ..$ label.main: chr [1:259] "Neutrophils" "Monocytes" "Neutrophils" "HSC" ...
  .. .. .. ..$ label.fine: chr [1:259] "Neutrophils" "Monocytes" "Neutrophils" "MEP" ...
  .. .. .. ..$ label.ont : chr [1:259] "CL:0000775" "CL:0000576" "CL:0000775" "CL:0000050" ...
  ..@ assays         :Formal class 'SimpleAssays' [package "SummarizedExperiment"] with 1 slot
  .. .. ..@ data:Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
  .. .. .. .. ..@ listData       :List of 1
  .. .. .. .. .. ..$ logcounts: num [1:19859, 1:259] 0 0 4.24 2.59 0 ...
  .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. .. .. ..$ : chr [1:19859] "TSPAN6" "TNMD" "DPM1" "SCYL3" ...
  .. .. .. .. .. .. .. ..$ : chr [1:259] "mature.neutrophil" "CD14.positive..CD16.negative.classical.monocyte" "mature.neutrophil.1" "megakaryocyte.erythroid.progenitor.cell" ...
  .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. ..@ metadata       : list()
  ..@ NAMES          : chr [1:19859] "TSPAN6" "TNMD" "DPM1" "SCYL3" ...
  ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. ..@ rownames       : NULL
  .. .. ..@ nrows          : int 19859
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  .. .. ..@ listData       : Named list()
  ..@ metadata       : list()

And this is celldex_1.16.0 (loaded, not attached):

> str(bpen)
Formal class 'SummarizedExperiment' [package "SummarizedExperiment"] with 5 slots
  ..@ colData        :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. ..@ rownames       : chr [1:259] "mature.neutrophil" "CD14.positive..CD16.negative.classical.monocyte" "mature.neutrophil.1" "megakaryocyte.erythroid.progenitor.cell" ...
  .. .. ..@ nrows          : int 259
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  .. .. ..@ listData       :List of 3
  .. .. .. ..$ label.main: chr [1:259] "Neutrophils" "Monocytes" "Neutrophils" "HSC" ...
  .. .. .. ..$ label.fine: chr [1:259] "Neutrophils" "Monocytes" "Neutrophils" "MEP" ...
  .. .. .. ..$ label.ont : chr [1:259] "CL:0000775" "CL:0000576" "CL:0000775" "CL:0000050" ...
  ..@ assays         :Formal class 'SimpleAssays' [package "SummarizedExperiment"] with 1 slot
  .. .. ..@ data:Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
  .. .. .. .. ..@ listData       :List of 1
  .. .. .. .. .. ..$ logcounts: num [1:19859, 1:259] 0 0 4.24 2.59 0 ...
  .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. ..@ metadata       : list()
  ..@ NAMES          : chr [1:19859] "TSPAN6" "TNMD" "DPM1" "SCYL3" ...
  ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. ..@ rownames       : NULL
  .. .. ..@ nrows          : int 19859
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  .. .. ..@ listData       : Named list()
  ..@ metadata       : list()

R v4.3.3 and v4.4.2 session info

> sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /home/ihsuan/miniconda3/envs/r4.3/lib/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8       
 [4] LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] HDF5Array_1.30.1            rhdf5_2.46.1                DelayedArray_0.28.0        
 [4] SparseArray_1.2.4           S4Arrays_1.2.1              abind_1.4-8                
 [7] Matrix_1.6-5                celldex_1.12.0              SingleR_2.4.1              
[10] SummarizedExperiment_1.32.0 Biobase_2.62.0              GenomicRanges_1.54.1       
[13] GenomeInfoDb_1.38.8         IRanges_2.36.0              S4Vectors_0.40.2           
[16] BiocGenerics_0.48.1         MatrixGenerics_1.14.0       matrixStats_1.4.1          

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1              dplyr_1.1.4                   blob_1.2.4                   
 [4] filelock_1.0.3                Biostrings_2.70.3             bitops_1.0-9                 
 [7] SingleCellExperiment_1.24.0   fastmap_1.2.0                 RCurl_1.98-1.16              
[10] BiocFileCache_2.10.2          promises_1.3.2                rsvd_1.0.5                   
[13] digest_0.6.37                 mime_0.12                     lifecycle_1.0.4              
[16] KEGGREST_1.42.0               interactiveDisplayBase_1.40.0 RSQLite_2.3.7                
[19] magrittr_2.0.3                compiler_4.3.3                rlang_1.1.4                  
[22] tools_4.3.3                   utf8_1.2.4                    yaml_2.3.10                  
[25] bit_4.5.0.1                   curl_6.0.1                    BiocParallel_1.36.0          
[28] withr_3.0.2                   purrr_1.0.2                   grid_4.3.3                   
[31] fansi_1.0.6                   ExperimentHub_2.10.0          beachmat_2.18.1              
[34] xtable_1.8-4                  Rhdf5lib_1.24.2               cli_3.6.3                    
[37] crayon_1.5.3                  generics_0.1.3                httr_1.4.7                   
[40] DelayedMatrixStats_1.24.0     scuttle_1.12.0                DBI_1.2.3                    
[43] cachem_1.1.0                  zlibbioc_1.48.2               parallel_4.3.3               
[46] AnnotationDbi_1.64.1          BiocManager_1.30.25           XVector_0.42.0               
[49] vctrs_0.6.5                   BiocSingular_1.18.0           bit64_4.5.2                  
[52] irlba_2.3.5.1                 glue_1.8.0                    codetools_0.2-20             
[55] BiocVersion_3.18.1            later_1.3.2                   ScaledMatrix_1.10.0          
[58] tibble_3.2.1                  pillar_1.9.0                  rhdf5filters_1.14.1          
[61] rappdirs_0.3.3                htmltools_0.5.8.1             GenomeInfoDbData_1.2.11      
[64] R6_2.5.1                      dbplyr_2.5.0                  sparseMatrixStats_1.14.0     
[67] shiny_1.10.0                  lattice_0.22-6                AnnotationHub_3.10.1         
[70] png_0.1-8                     memoise_2.0.1                 httpuv_1.6.15                
[73] Rcpp_1.0.13                   pkgconfig_2.0.3              

> sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /home/ihsuan/miniconda3/envs/r4.4/lib/libopenblasp-r0.3.28.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8       
 [4] LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] SingleR_2.8.0               SummarizedExperiment_1.36.0 Biobase_2.66.0             
 [4] GenomicRanges_1.58.0        GenomeInfoDb_1.42.1         IRanges_2.40.1             
 [7] S4Vectors_0.44.0            BiocGenerics_0.52.0         MatrixGenerics_1.18.0      
[10] matrixStats_1.4.1          

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1            alabaster.se_1.6.0          dplyr_1.1.4                
 [4] blob_1.2.4                  filelock_1.0.3              Biostrings_2.74.1          
 [7] SingleCellExperiment_1.28.1 fastmap_1.2.0               BiocFileCache_2.14.0       
[10] rsvd_1.0.5                  lifecycle_1.0.4             alabaster.matrix_1.6.1     
[13] KEGGREST_1.46.0             alabaster.base_1.6.1        RSQLite_2.3.9              
[16] magrittr_2.0.3              compiler_4.4.2              rlang_1.1.4                
[19] tools_4.4.2                 yaml_2.3.10                 S4Arrays_1.6.0             
[22] bit_4.5.0.1                 curl_6.0.1                  DelayedArray_0.32.0        
[25] abind_1.4-8                 BiocParallel_1.40.0         HDF5Array_1.34.0           
[28] gypsum_1.2.0                grid_4.4.2                  ExperimentHub_2.14.0       
[31] beachmat_2.22.0             Rhdf5lib_1.28.0             cli_3.6.3                  
[34] crayon_1.5.3                generics_0.1.3              httr_1.4.7                 
[37] DelayedMatrixStats_1.28.0   scuttle_1.16.0              DBI_1.2.3                  
[40] cachem_1.1.0                rhdf5_2.50.1                zlibbioc_1.52.0            
[43] parallel_4.4.2              AnnotationDbi_1.68.0        BiocManager_1.30.25        
[46] XVector_0.46.0              alabaster.schemas_1.6.0     vctrs_0.6.5                
[49] Matrix_1.7-1                jsonlite_1.8.9              BiocSingular_1.22.0        
[52] BiocNeighbors_2.0.1         bit64_4.5.2                 irlba_2.3.5.1              
[55] alabaster.ranges_1.6.0      glue_1.8.0                  codetools_0.2-20           
[58] BiocVersion_3.20.0          UCSC.utils_1.2.0            ScaledMatrix_1.14.0        
[61] tibble_3.2.1                pillar_1.10.0               rappdirs_0.3.3             
[64] rhdf5filters_1.18.0         GenomeInfoDbData_1.2.13     R6_2.5.1                   
[67] dbplyr_2.5.0                httr2_1.0.7                 sparseMatrixStats_1.18.0   
[70] lattice_0.22-6              AnnotationHub_3.14.0        png_0.1-8                  
[73] celldex_1.16.0              memoise_2.0.1               Rcpp_1.0.13-1              
[76] SparseArray_1.6.0           pkgconfig_2.0.3

The text was updated successfully, but these errors were encountered:

ycl6 · 2025-01-03T19:34:52Z

OK, so I decided to divide the SingleR function into sections and time each to find out which step might have contributed to the slower speed.

I am using the same sce as before. There is a change in how the test and ref are prepped in the two versions and I have adapted accordingly.

As you can see below, the new v2.8.0 is faster in this step. The trainSingleR() step takes less than a second to complete in both versions. However, classifySingleR() is where the new v2.8.0 loses out in terms of speed.

I'll dig a little into it and see what I can find.

library(BiocParallel)
library(DelayedArray)
library(SingleR)

bpen <- celldex::BlueprintEncodeData()
sce <- readRDS("sce.rds")

For R 4.3, v2.4.1

prep.input <- function (test, ref,
    assay.type.test = "logcounts", assay.type.ref = "logcounts",
    check.missing = TRUE, num.threads = bpnworkers(BPPARAM),
    BNPARAM = NULL, BPPARAM = SerialParam()) {
    if (!bpisup(BPPARAM) && !is(BPPARAM, "MulticoreParam")) {
        bpstart(BPPARAM)
        on.exit(bpstop(BPPARAM))
    }
    test <- SingleR:::.to_clean_matrix(test, assay.type.test, check.missing,
        msg = "test", BPPARAM = BPPARAM)
    if (single.ref <- !SingleR:::.is_list(ref)) {
        ref <- list(ref)
    }
    ref <- lapply(ref, FUN = SingleR:::.to_clean_matrix, assay.type = assay.type.ref,
        check.missing = check.missing, msg = "ref", BPPARAM = BPPARAM)
    refnames <- Reduce(intersect, lapply(ref, rownames))
    keep <- intersect(rownames(test), refnames)
    if (length(keep) == 0) {
        stop("no common genes between 'test' and 'ref'")
    }
    if (!identical(keep, rownames(test))) {
        test <- test[keep, ]
    }
    for (i in seq_along(ref)) {
        if (!identical(keep, rownames(ref[[i]]))) {
            ref[[i]] <- ref[[i]][keep, , drop = FALSE]
        }
    }
    if (single.ref) {
        ref <- ref[[1]]
    }
    return(list(ref, test))
}

system.time({
    x <- prep.input(test = sce, ref = list(BP = bpen), assay.type.test = "logcounts", num.threads = 2)
})

   user  system elapsed 
116.086  26.208 142.294

system.time({
    trained <- trainSingleR(x[[1]], labels = list(bpen$label.main), genes = "de", sd.thresh = 1,
        de.method = "classic", de.n = NULL,, de.args = list(),
        aggr.ref = FALSE, aggr.args = list(), recompute = TRUE,
        restrict = NULL, check.missing = TRUE,
        BNPARAM = NULL, num.threads = 2, BPPARAM = SerialParam())
})

   user  system elapsed 
  0.492   0.000   0.399

system.time({
    class <- classifySingleR(x[[2]], trained, quantile = 0.8, fine.tune = TRUE,
        tune.thresh = 0.05, prune = TRUE, check.missing = FALSE,
        num.threads = 2, BPPARAM = SerialParam())
})

   user  system elapsed 
 81.354   3.867  60.825

For R 4.4, v2.8.0

prep.input <- function (test, ref,
    assay.type.test = "logcounts", assay.type.ref = "logcounts",
    check.missing = TRUE, num.threads = bpnworkers(BPPARAM),
    BNPARAM = NULL, BPPARAM = SerialParam()) {
    if (!bpisup(BPPARAM) && !is(BPPARAM, "MulticoreParam")) {
        bpstart(BPPARAM)
        on.exit(bpstop(BPPARAM))
    }
    test <- SingleR:::.to_clean_matrix(test, assay.type.test, check.missing,
        msg = "test", BPPARAM = BPPARAM)
    tmp.ref <- ref
    if (!is.list(tmp.ref) || is.data.frame(tmp.ref)) {
        tmp.ref <- list(ref)
    }
    for (rr in tmp.ref) {
        keep <- rownames(test) %in% rownames(rr)
        if (!all(keep)) {
            test <- DelayedArray(test)[keep, , drop = FALSE]
        }
    }
    if (nrow(test) == 0) {
        stop("no common genes between 'test' and 'ref")
    }
    return(list(ref, test))
}

system.time({
    x <- prep.input(test = sce, ref = list(BP = bpen), assay.type.test = "logcounts", num.threads = 2)
})

   user  system elapsed 
 65.575   1.644  67.212

system.time({
    trained <- trainSingleR(x[[1]], labels = list(bpen$label.main), genes = "de", sd.thresh = 1,
        de.method = "classic", de.n = NULL,, de.args = list(),
        aggr.ref = FALSE, aggr.args = list(), recompute = TRUE,
        restrict = NULL, test.genes = rownames(x[[2]]), check.missing = TRUE,
        BNPARAM = NULL, num.threads = 2, BPPARAM = SerialParam())
})

   user  system elapsed 
  0.630   0.016   0.555

system.time({
    class <- classifySingleR(x[[2]], trained, quantile = 0.8, fine.tune = TRUE,
        tune.thresh = 0.05, prune = TRUE, check.missing = FALSE,
        num.threads = 2, BPPARAM = SerialParam())
})

   user  system elapsed 
316.810   5.407 213.873

ycl6 · 2025-01-03T21:04:23Z

OK, I've looked into the classifySingleR function and tested the time. As far as I can tell, the .classify_internals function and the C++ code (classify_single.cpp) associated with commit 74a6d59 is what drives the increase in time consumption, the rest make no difference. I don't know much about C language, so this is as far as I can do diagnosing this issue.

LTLA · 2025-01-03T22:40:32Z

For me:

R 4.4, BioC 3.20

library(SingleR)

library(celldex)
bp <- BlueprintEncodeData()

# Using Baron as it's human and it is large enough for some decent testing (~8k cells).
library(scRNAseq)
sce <- BaronPancreasData('human')

system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test=1))
##    user  system elapsed
##  29.179   0.081  29.259
system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test=1, num.threads=2))
##    user  system elapsed
##  30.008   0.141  15.756
system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test=1, num.threads=4))
##    user  system elapsed
##  30.981   0.051   8.795

Session information

R version 4.4.2 Patched (2024-11-18 r87348)
Platform: aarch64-apple-darwin22.6.0
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Users/luna/Software/R/R-4-4-branch/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/R-4-4-branch/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] scRNAseq_2.20.0             SingleCellExperiment_1.28.1
 [3] celldex_1.16.0              SingleR_2.8.0
 [5] SummarizedExperiment_1.36.0 Biobase_2.66.0
 [7] GenomicRanges_1.58.0        GenomeInfoDb_1.42.1
 [9] IRanges_2.40.1              S4Vectors_0.44.0
[11] BiocGenerics_0.52.0         MatrixGenerics_1.18.0
[13] matrixStats_1.4.1

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                 bitops_1.0-9
 [3] httr2_1.0.7               rlang_1.1.4
 [5] magrittr_2.0.3            gypsum_1.2.0
 [7] compiler_4.4.2            RSQLite_2.3.9
 [9] GenomicFeatures_1.58.0    DelayedMatrixStats_1.28.0
[11] png_0.1-8                 vctrs_0.6.5
[13] ProtGenerics_1.38.0       pkgconfig_2.0.3
[15] crayon_1.5.3              fastmap_1.2.0
[17] dbplyr_2.5.0              XVector_0.46.0
[19] Rsamtools_2.22.0          UCSC.utils_1.2.0
[21] bit_4.5.0.1               zlibbioc_1.52.0
[23] cachem_1.1.0              beachmat_2.22.0
[25] jsonlite_1.8.9            blob_1.2.4
[27] rhdf5filters_1.18.0       DelayedArray_0.32.0
[29] Rhdf5lib_1.28.0           BiocParallel_1.40.0
[31] irlba_2.3.5.1             parallel_4.4.2
[33] R6_2.5.1                  rtracklayer_1.66.0
[35] Rcpp_1.0.13-1             Matrix_1.7-1
[37] tidyselect_1.2.1          abind_1.4-8
[39] yaml_2.3.10               codetools_0.2-20
[41] curl_6.0.1                alabaster.sce_1.6.0
[43] lattice_0.22-6            tibble_3.2.1
[45] KEGGREST_1.46.0           BiocFileCache_2.14.0
[47] alabaster.schemas_1.6.0   ExperimentHub_2.14.0
[49] Biostrings_2.74.1         pillar_1.10.0
[51] BiocManager_1.30.25       filelock_1.0.3
[53] generics_0.1.3            RCurl_1.98-1.16
[55] BiocVersion_3.20.0        ensembldb_2.30.0
[57] sparseMatrixStats_1.18.0  alabaster.base_1.6.1
[59] glue_1.8.0                alabaster.ranges_1.6.0
[61] alabaster.matrix_1.6.1    lazyeval_0.2.2
[63] tools_4.4.2               AnnotationHub_3.14.0
[65] BiocIO_1.16.0             BiocNeighbors_2.0.1
[67] ScaledMatrix_1.14.0       GenomicAlignments_1.42.0
[69] XML_3.99-0.18             rhdf5_2.50.1
[71] grid_4.4.2                AnnotationDbi_1.68.0
[73] GenomeInfoDbData_1.2.13   BiocSingular_1.22.0
[75] HDF5Array_1.34.0          restfulr_0.0.15
[77] cli_3.6.3                 rsvd_1.0.5
[79] rappdirs_0.3.3            S4Arrays_1.6.0
[81] dplyr_1.1.4               AnnotationFilter_1.30.0
[83] alabaster.se_1.6.0        SparseArray_1.6.0
[85] rjson_0.2.23              memoise_2.0.1
[87] lifecycle_1.0.4           httr_1.4.7
[89] bit64_4.5.2

R 4.3, BioC 3.18

# Omitting the identical setup code here...

system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test=1))
##    user  system elapsed
##  37.935   0.156  38.186
system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test=1, num.threads=2))
##    user  system elapsed
##  38.296   0.135  19.808
system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test=1, num.threads=4))
##    user  system elapsed
##  39.354   0.120  11.058

Session information

R version 4.3.3 Patched (2024-04-09 r87519)
Platform: aarch64-apple-darwin23.6.0 (64-bit)
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Users/luna/Software/R/R-4-3-branch/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/R-4-3-branch/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] scRNAseq_2.16.0             SingleCellExperiment_1.24.0
 [3] celldex_1.12.0              SingleR_2.4.1
 [5] SummarizedExperiment_1.32.0 Biobase_2.62.0
 [7] GenomicRanges_1.54.1        GenomeInfoDb_1.38.8
 [9] IRanges_2.36.0              S4Vectors_0.40.2
[11] BiocGenerics_0.48.1         MatrixGenerics_1.14.0
[13] matrixStats_1.4.1

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                     bitops_1.0-9
 [3] biomaRt_2.58.2                rlang_1.1.4
 [5] magrittr_2.0.3                compiler_4.3.3
 [7] RSQLite_2.3.9                 GenomicFeatures_1.54.4
 [9] DelayedMatrixStats_1.24.0     png_0.1-8
[11] vctrs_0.6.5                   ProtGenerics_1.34.0
[13] stringr_1.5.1                 pkgconfig_2.0.3
[15] crayon_1.5.3                  fastmap_1.2.0
[17] dbplyr_2.5.0                  XVector_0.42.0
[19] Rsamtools_2.18.0              promises_1.3.2
[21] purrr_1.0.2                   bit_4.5.0.1
[23] zlibbioc_1.48.2               cachem_1.1.0
[25] beachmat_2.18.1               progress_1.2.3
[27] blob_1.2.4                    later_1.4.1
[29] DelayedArray_0.28.0           BiocParallel_1.36.0
[31] interactiveDisplayBase_1.40.0 irlba_2.3.5.1
[33] parallel_4.3.3                prettyunits_1.2.0
[35] R6_2.5.1                      stringi_1.8.4
[37] rtracklayer_1.62.0            Rcpp_1.0.13-1
[39] httpuv_1.6.15                 Matrix_1.6-5
[41] tidyselect_1.2.1              abind_1.4-8
[43] yaml_2.3.10                   codetools_0.2-20
[45] curl_6.0.1                    lattice_0.22-6
[47] tibble_3.2.1                  shiny_1.10.0
[49] withr_3.0.2                   KEGGREST_1.42.0
[51] BiocFileCache_2.10.2          xml2_1.3.6
[53] ExperimentHub_2.10.0          Biostrings_2.70.3
[55] pillar_1.10.0                 BiocManager_1.30.25
[57] filelock_1.0.3                generics_0.1.3
[59] RCurl_1.98-1.16               BiocVersion_3.18.1
[61] ensembldb_2.26.0              hms_1.1.3
[63] sparseMatrixStats_1.14.0      xtable_1.8-4
[65] glue_1.8.0                    lazyeval_0.2.2
[67] tools_4.3.3                   AnnotationHub_3.10.1
[69] BiocIO_1.12.0                 ScaledMatrix_1.10.0
[71] GenomicAlignments_1.38.2      XML_3.99-0.18
[73] grid_4.3.3                    AnnotationDbi_1.64.1
[75] GenomeInfoDbData_1.2.11       BiocSingular_1.18.0
[77] restfulr_0.0.15               cli_3.6.3
[79] rsvd_1.0.5                    rappdirs_0.3.3
[81] S4Arrays_1.2.1                dplyr_1.1.4
[83] AnnotationFilter_1.26.0       digest_0.6.37
[85] SparseArray_1.2.4             rjson_0.2.23
[87] memoise_2.0.1                 htmltools_0.5.8.1
[89] lifecycle_1.0.4               httr_1.4.7
[91] mime_0.12                     bit64_4.5.2

As we can see, SingleR 2.8.0 is a little bit faster than 2.4.1 on my machine (Mac M2). If you can't share your dataset, try to reproduce your timings with a dataset in scRNAseq. If you're still getting worse timings for 2.8.0 on BaronPancreasData, then I don't have any more ideas.

Another observation is that, for me, SingleR scales nicely as we double the number of threads from 1 to 2, and then 2 to 4. If you're not observing this, it suggests that there is a serial processing bottleneck - specifically related to your test dataset, given that classifySingleR() is most affected. My best guess is that your test dataset uses a matrix representation that is not amenable to parallel data extraction - for example, HDF5-based arrays only allow serial reading within a process, and unsupported delayed operations force the underlying beachmat code to call R from C++ in order to extract matrix contents. (If such "exotic" matrices are involved, the speed of data extraction depends on other packages like HDF5Array and DelayedArray, and so any differences in timings are largely outside of my control. This can be somewhat circumvented in the HDF5 case by using beachmat.hdf5 but the serial nature of HDF5 is unavoidable here.)

I can say that the slight differences in results are expected and are discussed in the NEWS for 2.8.0, see

SingleR/inst/NEWS.Rd

Lines 34 to 38 in c2922b3

    
           \item The introduction of \code{test.genes=} means that we no longer need to explicitly subset the rows of the reference dataset (to match the test features) in \code{SingleR()}. 
        
           This saves memory by avoiding an unnecessary copy of the reference dataset, but may also slightly alter the marker selection as ties are broken in a different way. 
        
           Namely, if the top X genes are used as markers, and the X-th and (X+1)-th gene have the same log-fold change,  
        
           tie breaking will be based on the ordering of the rows in the reference matrix - which is no longer the same as in the previous version of \pkg{SingleR}. 
        
           This results in some slight differences in the markers that propagate down to the classification results.

ycl6 · 2025-01-05T15:37:12Z

Hi @LTLA

Thanks for your explanation. I was indeed using an object that has its assays stored in DelayedMatrix format (loaded the CR data from the h5 file). I was using a demo dataset from 10x Genomics, but I think the the datasets from TENxPBMCData offers examples in DelayedMatrix too, so will be easier to replicate.

Here I am using pbmc33k that has 32,738 genes and 33,148 cells. I didn't run this through my workflow to clean-up or add other bits of info as I did with my previous datasets that seems to slow things down a lot when running SingleR. Here the old version in R 4.3 is only slightly faster. But as you said, there is no significant gain using more num.threads if the assay matrix is a HDF5-based array.

library(SingleR)
library(scran)
library(scuttle)

library(TENxPBMCData)
sce <- TENxPBMCData(dataset = "pbmc33k")
rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ENSEMBL_ID, rowData(sce)$Symbol_TENx)
colnames(sce) <- sce$Barcode

set.seed(1000)
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, cluster=clusters)
sce <- logNormCounts(sce)

bp <- celldex::BlueprintEncodeData()

R 4.4, BioC 3.20

system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test='logcounts'))
##   user  system elapsed 
##126.988   0.317 127.297 
system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test='logcounts', num.threads=2))
##   user  system elapsed 
##169.576   0.241 107.422 
system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test='logcounts', num.threads=4))
##   user  system elapsed 
##169.480   0.185  79.117

R 4.3, BioC 3.18

system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test='logcounts'))
##   user  system elapsed 
##158.698   0.121 158.803 
system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test='logcounts', num.threads=2))
##   user  system elapsed 
##170.560   0.148 101.268 
system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test='logcounts', num.threads=4))
##   user  system elapsed 
##181.603   0.191  72.212

I am now tweaking my workflow to create dgCMatrix assays that can be use for calculations that'll benefit from parallelisation with a BiocParallel interface (not just SingleR) and keep the main object still HDF5-based. It's great to re-work and optimise the workflow using what I learn here. Thanks!

ycl6 closed this as completed Jan 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed issue with SingleR v2.8.0 in BioC 3.20 #292

Speed issue with SingleR v2.8.0 in BioC 3.20 #292

ycl6 commented Jan 3, 2025 •

edited

Loading

ycl6 commented Jan 3, 2025 •

edited

Loading

ycl6 commented Jan 3, 2025

LTLA commented Jan 3, 2025

ycl6 commented Jan 5, 2025

Speed issue with SingleR v2.8.0 in BioC 3.20 #292

Speed issue with SingleR v2.8.0 in BioC 3.20 #292

Comments

ycl6 commented Jan 3, 2025 • edited Loading

ycl6 commented Jan 3, 2025 • edited Loading

For R 4.3, v2.4.1

For R 4.4, v2.8.0

ycl6 commented Jan 3, 2025

LTLA commented Jan 3, 2025

R 4.4, BioC 3.20

R 4.3, BioC 3.18

ycl6 commented Jan 5, 2025

R 4.4, BioC 3.20

R 4.3, BioC 3.18

ycl6 commented Jan 3, 2025 •

edited

Loading

ycl6 commented Jan 3, 2025 •

edited

Loading