Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed issue with SingleR v2.8.0 in BioC 3.20 #292

Closed
ycl6 opened this issue Jan 3, 2025 · 4 comments
Closed

Speed issue with SingleR v2.8.0 in BioC 3.20 #292

ycl6 opened this issue Jan 3, 2025 · 4 comments

Comments

@ycl6
Copy link

ycl6 commented Jan 3, 2025

Hi,

I am testing my single-cell workflow using the new R-4.4 (v4.4.2) these couple of days and noticed SingleR (v2.8.0) is running a lot slower compared with my other environment that has R-4.3 and SingleR v2.4.1.

From the commit history, I don't see what might have caused the drop in speed between the two versions.
I am wondering if anyone has any insight into this and how I can run the new version of SingleR as fast as before (is there any parameter I can tweak)?

Below is how I run SingleR and my sce has 9,188 genes and 16,601 cells.

library(SingleR)

bpen <- celldex::BlueprintEncodeData()

sce <- readRDS("sce.rds")

system.time({
x <- SingleR(test = sce, assay.type.test = "logcounts", de.method = "classic", 
             ref = list(BP = bpen), labels = list(bpen$label.main), num.threads = 2)
})

table(x$labels)

With R-4.3 and SingleR v2.4.1, I have

   user  system elapsed 
192.798  29.163 199.104 

     B-cells CD4+ T-cells CD8+ T-cells  Eosinophils Erythrocytes          HSC    Monocytes 
        1503         7341         3475            6            1           10         3104 
    NK cells 
        1161 

and with R-4.4 and SingleR v2.8.0, I have

   user  system elapsed 
384.427   8.220 283.556 

     B-cells CD4+ T-cells CD8+ T-cells Erythrocytes          HSC    Monocytes     NK cells 
        1506         7336         3500            1           10         3107         1141 

When I change num.threads to 4 (no speed gain in both, new version still runs slower):

With R-4.3 and SingleR v2.4.1, I have

   user  system elapsed 
187.586  22.508 187.168 

and with R-4.4 and SingleR v2.8.0, I have

   user  system elapsed 
380.693   8.094 280.565 

p.s. The predicted cell type labels are a little different, but I am guessing its due to updates with the celldex package. Although, I did notice the assays structure of the reference I am using (BlueprintEncodeData) is a little different.

This is celldex_1.12.0 (loaded and attached):

> str(bpen)
Formal class 'SummarizedExperiment' [package "SummarizedExperiment"] with 5 slots
  ..@ colData        :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. ..@ rownames       : chr [1:259] "mature.neutrophil" "CD14.positive..CD16.negative.classical.monocyte" "mature.neutrophil.1" "megakaryocyte.erythroid.progenitor.cell" ...
  .. .. ..@ nrows          : int 259
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  .. .. ..@ listData       :List of 3
  .. .. .. ..$ label.main: chr [1:259] "Neutrophils" "Monocytes" "Neutrophils" "HSC" ...
  .. .. .. ..$ label.fine: chr [1:259] "Neutrophils" "Monocytes" "Neutrophils" "MEP" ...
  .. .. .. ..$ label.ont : chr [1:259] "CL:0000775" "CL:0000576" "CL:0000775" "CL:0000050" ...
  ..@ assays         :Formal class 'SimpleAssays' [package "SummarizedExperiment"] with 1 slot
  .. .. ..@ data:Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
  .. .. .. .. ..@ listData       :List of 1
  .. .. .. .. .. ..$ logcounts: num [1:19859, 1:259] 0 0 4.24 2.59 0 ...
  .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. .. .. ..$ : chr [1:19859] "TSPAN6" "TNMD" "DPM1" "SCYL3" ...
  .. .. .. .. .. .. .. ..$ : chr [1:259] "mature.neutrophil" "CD14.positive..CD16.negative.classical.monocyte" "mature.neutrophil.1" "megakaryocyte.erythroid.progenitor.cell" ...
  .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. ..@ metadata       : list()
  ..@ NAMES          : chr [1:19859] "TSPAN6" "TNMD" "DPM1" "SCYL3" ...
  ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. ..@ rownames       : NULL
  .. .. ..@ nrows          : int 19859
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  .. .. ..@ listData       : Named list()
  ..@ metadata       : list()

And this is celldex_1.16.0 (loaded, not attached):

> str(bpen)
Formal class 'SummarizedExperiment' [package "SummarizedExperiment"] with 5 slots
  ..@ colData        :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. ..@ rownames       : chr [1:259] "mature.neutrophil" "CD14.positive..CD16.negative.classical.monocyte" "mature.neutrophil.1" "megakaryocyte.erythroid.progenitor.cell" ...
  .. .. ..@ nrows          : int 259
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  .. .. ..@ listData       :List of 3
  .. .. .. ..$ label.main: chr [1:259] "Neutrophils" "Monocytes" "Neutrophils" "HSC" ...
  .. .. .. ..$ label.fine: chr [1:259] "Neutrophils" "Monocytes" "Neutrophils" "MEP" ...
  .. .. .. ..$ label.ont : chr [1:259] "CL:0000775" "CL:0000576" "CL:0000775" "CL:0000050" ...
  ..@ assays         :Formal class 'SimpleAssays' [package "SummarizedExperiment"] with 1 slot
  .. .. ..@ data:Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
  .. .. .. .. ..@ listData       :List of 1
  .. .. .. .. .. ..$ logcounts: num [1:19859, 1:259] 0 0 4.24 2.59 0 ...
  .. .. .. .. ..@ elementType    : chr "ANY"
  .. .. .. .. ..@ elementMetadata: NULL
  .. .. .. .. ..@ metadata       : list()
  ..@ NAMES          : chr [1:19859] "TSPAN6" "TNMD" "DPM1" "SCYL3" ...
  ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
  .. .. ..@ rownames       : NULL
  .. .. ..@ nrows          : int 19859
  .. .. ..@ elementType    : chr "ANY"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  .. .. ..@ listData       : Named list()
  ..@ metadata       : list()
R v4.3.3 and v4.4.2 session info
> sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /home/ihsuan/miniconda3/envs/r4.3/lib/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8       
 [4] LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] HDF5Array_1.30.1            rhdf5_2.46.1                DelayedArray_0.28.0        
 [4] SparseArray_1.2.4           S4Arrays_1.2.1              abind_1.4-8                
 [7] Matrix_1.6-5                celldex_1.12.0              SingleR_2.4.1              
[10] SummarizedExperiment_1.32.0 Biobase_2.62.0              GenomicRanges_1.54.1       
[13] GenomeInfoDb_1.38.8         IRanges_2.36.0              S4Vectors_0.40.2           
[16] BiocGenerics_0.48.1         MatrixGenerics_1.14.0       matrixStats_1.4.1          

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1              dplyr_1.1.4                   blob_1.2.4                   
 [4] filelock_1.0.3                Biostrings_2.70.3             bitops_1.0-9                 
 [7] SingleCellExperiment_1.24.0   fastmap_1.2.0                 RCurl_1.98-1.16              
[10] BiocFileCache_2.10.2          promises_1.3.2                rsvd_1.0.5                   
[13] digest_0.6.37                 mime_0.12                     lifecycle_1.0.4              
[16] KEGGREST_1.42.0               interactiveDisplayBase_1.40.0 RSQLite_2.3.7                
[19] magrittr_2.0.3                compiler_4.3.3                rlang_1.1.4                  
[22] tools_4.3.3                   utf8_1.2.4                    yaml_2.3.10                  
[25] bit_4.5.0.1                   curl_6.0.1                    BiocParallel_1.36.0          
[28] withr_3.0.2                   purrr_1.0.2                   grid_4.3.3                   
[31] fansi_1.0.6                   ExperimentHub_2.10.0          beachmat_2.18.1              
[34] xtable_1.8-4                  Rhdf5lib_1.24.2               cli_3.6.3                    
[37] crayon_1.5.3                  generics_0.1.3                httr_1.4.7                   
[40] DelayedMatrixStats_1.24.0     scuttle_1.12.0                DBI_1.2.3                    
[43] cachem_1.1.0                  zlibbioc_1.48.2               parallel_4.3.3               
[46] AnnotationDbi_1.64.1          BiocManager_1.30.25           XVector_0.42.0               
[49] vctrs_0.6.5                   BiocSingular_1.18.0           bit64_4.5.2                  
[52] irlba_2.3.5.1                 glue_1.8.0                    codetools_0.2-20             
[55] BiocVersion_3.18.1            later_1.3.2                   ScaledMatrix_1.10.0          
[58] tibble_3.2.1                  pillar_1.9.0                  rhdf5filters_1.14.1          
[61] rappdirs_0.3.3                htmltools_0.5.8.1             GenomeInfoDbData_1.2.11      
[64] R6_2.5.1                      dbplyr_2.5.0                  sparseMatrixStats_1.14.0     
[67] shiny_1.10.0                  lattice_0.22-6                AnnotationHub_3.10.1         
[70] png_0.1-8                     memoise_2.0.1                 httpuv_1.6.15                
[73] Rcpp_1.0.13                   pkgconfig_2.0.3              

> sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-conda-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /home/ihsuan/miniconda3/envs/r4.4/lib/libopenblasp-r0.3.28.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8       
 [4] LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] SingleR_2.8.0               SummarizedExperiment_1.36.0 Biobase_2.66.0             
 [4] GenomicRanges_1.58.0        GenomeInfoDb_1.42.1         IRanges_2.40.1             
 [7] S4Vectors_0.44.0            BiocGenerics_0.52.0         MatrixGenerics_1.18.0      
[10] matrixStats_1.4.1          

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1            alabaster.se_1.6.0          dplyr_1.1.4                
 [4] blob_1.2.4                  filelock_1.0.3              Biostrings_2.74.1          
 [7] SingleCellExperiment_1.28.1 fastmap_1.2.0               BiocFileCache_2.14.0       
[10] rsvd_1.0.5                  lifecycle_1.0.4             alabaster.matrix_1.6.1     
[13] KEGGREST_1.46.0             alabaster.base_1.6.1        RSQLite_2.3.9              
[16] magrittr_2.0.3              compiler_4.4.2              rlang_1.1.4                
[19] tools_4.4.2                 yaml_2.3.10                 S4Arrays_1.6.0             
[22] bit_4.5.0.1                 curl_6.0.1                  DelayedArray_0.32.0        
[25] abind_1.4-8                 BiocParallel_1.40.0         HDF5Array_1.34.0           
[28] gypsum_1.2.0                grid_4.4.2                  ExperimentHub_2.14.0       
[31] beachmat_2.22.0             Rhdf5lib_1.28.0             cli_3.6.3                  
[34] crayon_1.5.3                generics_0.1.3              httr_1.4.7                 
[37] DelayedMatrixStats_1.28.0   scuttle_1.16.0              DBI_1.2.3                  
[40] cachem_1.1.0                rhdf5_2.50.1                zlibbioc_1.52.0            
[43] parallel_4.4.2              AnnotationDbi_1.68.0        BiocManager_1.30.25        
[46] XVector_0.46.0              alabaster.schemas_1.6.0     vctrs_0.6.5                
[49] Matrix_1.7-1                jsonlite_1.8.9              BiocSingular_1.22.0        
[52] BiocNeighbors_2.0.1         bit64_4.5.2                 irlba_2.3.5.1              
[55] alabaster.ranges_1.6.0      glue_1.8.0                  codetools_0.2-20           
[58] BiocVersion_3.20.0          UCSC.utils_1.2.0            ScaledMatrix_1.14.0        
[61] tibble_3.2.1                pillar_1.10.0               rappdirs_0.3.3             
[64] rhdf5filters_1.18.0         GenomeInfoDbData_1.2.13     R6_2.5.1                   
[67] dbplyr_2.5.0                httr2_1.0.7                 sparseMatrixStats_1.18.0   
[70] lattice_0.22-6              AnnotationHub_3.14.0        png_0.1-8                  
[73] celldex_1.16.0              memoise_2.0.1               Rcpp_1.0.13-1              
[76] SparseArray_1.6.0           pkgconfig_2.0.3            
@ycl6
Copy link
Author

ycl6 commented Jan 3, 2025

OK, so I decided to divide the SingleR function into sections and time each to find out which step might have contributed to the slower speed.

I am using the same sce as before. There is a change in how the test and ref are prepped in the two versions and I have adapted accordingly.

As you can see below, the new v2.8.0 is faster in this step. The trainSingleR() step takes less than a second to complete in both versions. However, classifySingleR() is where the new v2.8.0 loses out in terms of speed.

I'll dig a little into it and see what I can find.


library(BiocParallel)
library(DelayedArray)
library(SingleR)

bpen <- celldex::BlueprintEncodeData()
sce <- readRDS("sce.rds")

For R 4.3, v2.4.1

prep.input <- function (test, ref,
    assay.type.test = "logcounts", assay.type.ref = "logcounts",
    check.missing = TRUE, num.threads = bpnworkers(BPPARAM),
    BNPARAM = NULL, BPPARAM = SerialParam()) {
    if (!bpisup(BPPARAM) && !is(BPPARAM, "MulticoreParam")) {
        bpstart(BPPARAM)
        on.exit(bpstop(BPPARAM))
    }
    test <- SingleR:::.to_clean_matrix(test, assay.type.test, check.missing,
        msg = "test", BPPARAM = BPPARAM)
    if (single.ref <- !SingleR:::.is_list(ref)) {
        ref <- list(ref)
    }
    ref <- lapply(ref, FUN = SingleR:::.to_clean_matrix, assay.type = assay.type.ref,
        check.missing = check.missing, msg = "ref", BPPARAM = BPPARAM)
    refnames <- Reduce(intersect, lapply(ref, rownames))
    keep <- intersect(rownames(test), refnames)
    if (length(keep) == 0) {
        stop("no common genes between 'test' and 'ref'")
    }
    if (!identical(keep, rownames(test))) {
        test <- test[keep, ]
    }
    for (i in seq_along(ref)) {
        if (!identical(keep, rownames(ref[[i]]))) {
            ref[[i]] <- ref[[i]][keep, , drop = FALSE]
        }
    }
    if (single.ref) {
        ref <- ref[[1]]
    }
    return(list(ref, test))
}
system.time({
    x <- prep.input(test = sce, ref = list(BP = bpen), assay.type.test = "logcounts", num.threads = 2)
})
   user  system elapsed 
116.086  26.208 142.294 
system.time({
    trained <- trainSingleR(x[[1]], labels = list(bpen$label.main), genes = "de", sd.thresh = 1,
        de.method = "classic", de.n = NULL,, de.args = list(),
        aggr.ref = FALSE, aggr.args = list(), recompute = TRUE,
        restrict = NULL, check.missing = TRUE,
        BNPARAM = NULL, num.threads = 2, BPPARAM = SerialParam())
})
   user  system elapsed 
  0.492   0.000   0.399 
system.time({
    class <- classifySingleR(x[[2]], trained, quantile = 0.8, fine.tune = TRUE,
        tune.thresh = 0.05, prune = TRUE, check.missing = FALSE,
        num.threads = 2, BPPARAM = SerialParam())
})
   user  system elapsed 
 81.354   3.867  60.825 

For R 4.4, v2.8.0

prep.input <- function (test, ref,
    assay.type.test = "logcounts", assay.type.ref = "logcounts",
    check.missing = TRUE, num.threads = bpnworkers(BPPARAM),
    BNPARAM = NULL, BPPARAM = SerialParam()) {
    if (!bpisup(BPPARAM) && !is(BPPARAM, "MulticoreParam")) {
        bpstart(BPPARAM)
        on.exit(bpstop(BPPARAM))
    }
    test <- SingleR:::.to_clean_matrix(test, assay.type.test, check.missing,
        msg = "test", BPPARAM = BPPARAM)
    tmp.ref <- ref
    if (!is.list(tmp.ref) || is.data.frame(tmp.ref)) {
        tmp.ref <- list(ref)
    }
    for (rr in tmp.ref) {
        keep <- rownames(test) %in% rownames(rr)
        if (!all(keep)) {
            test <- DelayedArray(test)[keep, , drop = FALSE]
        }
    }
    if (nrow(test) == 0) {
        stop("no common genes between 'test' and 'ref")
    }
    return(list(ref, test))
}
system.time({
    x <- prep.input(test = sce, ref = list(BP = bpen), assay.type.test = "logcounts", num.threads = 2)
})
   user  system elapsed 
 65.575   1.644  67.212 
system.time({
    trained <- trainSingleR(x[[1]], labels = list(bpen$label.main), genes = "de", sd.thresh = 1,
        de.method = "classic", de.n = NULL,, de.args = list(),
        aggr.ref = FALSE, aggr.args = list(), recompute = TRUE,
        restrict = NULL, test.genes = rownames(x[[2]]), check.missing = TRUE,
        BNPARAM = NULL, num.threads = 2, BPPARAM = SerialParam())
})
   user  system elapsed 
  0.630   0.016   0.555 
system.time({
    class <- classifySingleR(x[[2]], trained, quantile = 0.8, fine.tune = TRUE,
        tune.thresh = 0.05, prune = TRUE, check.missing = FALSE,
        num.threads = 2, BPPARAM = SerialParam())
})
   user  system elapsed 
316.810   5.407 213.873 

@ycl6
Copy link
Author

ycl6 commented Jan 3, 2025

OK, I've looked into the classifySingleR function and tested the time. As far as I can tell, the .classify_internals function and the C++ code (classify_single.cpp) associated with commit 74a6d59 is what drives the increase in time consumption, the rest make no difference. I don't know much about C language, so this is as far as I can do diagnosing this issue.

@LTLA
Copy link
Member

LTLA commented Jan 3, 2025

For me:

R 4.4, BioC 3.20

library(SingleR)

library(celldex)
bp <- BlueprintEncodeData()

# Using Baron as it's human and it is large enough for some decent testing (~8k cells).
library(scRNAseq)
sce <- BaronPancreasData('human')

system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test=1))
##    user  system elapsed
##  29.179   0.081  29.259
system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test=1, num.threads=2))
##    user  system elapsed
##  30.008   0.141  15.756
system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test=1, num.threads=4))
##    user  system elapsed
##  30.981   0.051   8.795
Session information
R version 4.4.2 Patched (2024-11-18 r87348)
Platform: aarch64-apple-darwin22.6.0
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Users/luna/Software/R/R-4-4-branch/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/R-4-4-branch/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] scRNAseq_2.20.0             SingleCellExperiment_1.28.1
 [3] celldex_1.16.0              SingleR_2.8.0
 [5] SummarizedExperiment_1.36.0 Biobase_2.66.0
 [7] GenomicRanges_1.58.0        GenomeInfoDb_1.42.1
 [9] IRanges_2.40.1              S4Vectors_0.44.0
[11] BiocGenerics_0.52.0         MatrixGenerics_1.18.0
[13] matrixStats_1.4.1

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                 bitops_1.0-9
 [3] httr2_1.0.7               rlang_1.1.4
 [5] magrittr_2.0.3            gypsum_1.2.0
 [7] compiler_4.4.2            RSQLite_2.3.9
 [9] GenomicFeatures_1.58.0    DelayedMatrixStats_1.28.0
[11] png_0.1-8                 vctrs_0.6.5
[13] ProtGenerics_1.38.0       pkgconfig_2.0.3
[15] crayon_1.5.3              fastmap_1.2.0
[17] dbplyr_2.5.0              XVector_0.46.0
[19] Rsamtools_2.22.0          UCSC.utils_1.2.0
[21] bit_4.5.0.1               zlibbioc_1.52.0
[23] cachem_1.1.0              beachmat_2.22.0
[25] jsonlite_1.8.9            blob_1.2.4
[27] rhdf5filters_1.18.0       DelayedArray_0.32.0
[29] Rhdf5lib_1.28.0           BiocParallel_1.40.0
[31] irlba_2.3.5.1             parallel_4.4.2
[33] R6_2.5.1                  rtracklayer_1.66.0
[35] Rcpp_1.0.13-1             Matrix_1.7-1
[37] tidyselect_1.2.1          abind_1.4-8
[39] yaml_2.3.10               codetools_0.2-20
[41] curl_6.0.1                alabaster.sce_1.6.0
[43] lattice_0.22-6            tibble_3.2.1
[45] KEGGREST_1.46.0           BiocFileCache_2.14.0
[47] alabaster.schemas_1.6.0   ExperimentHub_2.14.0
[49] Biostrings_2.74.1         pillar_1.10.0
[51] BiocManager_1.30.25       filelock_1.0.3
[53] generics_0.1.3            RCurl_1.98-1.16
[55] BiocVersion_3.20.0        ensembldb_2.30.0
[57] sparseMatrixStats_1.18.0  alabaster.base_1.6.1
[59] glue_1.8.0                alabaster.ranges_1.6.0
[61] alabaster.matrix_1.6.1    lazyeval_0.2.2
[63] tools_4.4.2               AnnotationHub_3.14.0
[65] BiocIO_1.16.0             BiocNeighbors_2.0.1
[67] ScaledMatrix_1.14.0       GenomicAlignments_1.42.0
[69] XML_3.99-0.18             rhdf5_2.50.1
[71] grid_4.4.2                AnnotationDbi_1.68.0
[73] GenomeInfoDbData_1.2.13   BiocSingular_1.22.0
[75] HDF5Array_1.34.0          restfulr_0.0.15
[77] cli_3.6.3                 rsvd_1.0.5
[79] rappdirs_0.3.3            S4Arrays_1.6.0
[81] dplyr_1.1.4               AnnotationFilter_1.30.0
[83] alabaster.se_1.6.0        SparseArray_1.6.0
[85] rjson_0.2.23              memoise_2.0.1
[87] lifecycle_1.0.4           httr_1.4.7
[89] bit64_4.5.2

R 4.3, BioC 3.18

# Omitting the identical setup code here...

system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test=1))
##    user  system elapsed
##  37.935   0.156  38.186
system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test=1, num.threads=2))
##    user  system elapsed
##  38.296   0.135  19.808
system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test=1, num.threads=4))
##    user  system elapsed
##  39.354   0.120  11.058
Session information
R version 4.3.3 Patched (2024-04-09 r87519)
Platform: aarch64-apple-darwin23.6.0 (64-bit)
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Users/luna/Software/R/R-4-3-branch/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/R-4-3-branch/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] scRNAseq_2.16.0             SingleCellExperiment_1.24.0
 [3] celldex_1.12.0              SingleR_2.4.1
 [5] SummarizedExperiment_1.32.0 Biobase_2.62.0
 [7] GenomicRanges_1.54.1        GenomeInfoDb_1.38.8
 [9] IRanges_2.36.0              S4Vectors_0.40.2
[11] BiocGenerics_0.48.1         MatrixGenerics_1.14.0
[13] matrixStats_1.4.1

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                     bitops_1.0-9
 [3] biomaRt_2.58.2                rlang_1.1.4
 [5] magrittr_2.0.3                compiler_4.3.3
 [7] RSQLite_2.3.9                 GenomicFeatures_1.54.4
 [9] DelayedMatrixStats_1.24.0     png_0.1-8
[11] vctrs_0.6.5                   ProtGenerics_1.34.0
[13] stringr_1.5.1                 pkgconfig_2.0.3
[15] crayon_1.5.3                  fastmap_1.2.0
[17] dbplyr_2.5.0                  XVector_0.42.0
[19] Rsamtools_2.18.0              promises_1.3.2
[21] purrr_1.0.2                   bit_4.5.0.1
[23] zlibbioc_1.48.2               cachem_1.1.0
[25] beachmat_2.18.1               progress_1.2.3
[27] blob_1.2.4                    later_1.4.1
[29] DelayedArray_0.28.0           BiocParallel_1.36.0
[31] interactiveDisplayBase_1.40.0 irlba_2.3.5.1
[33] parallel_4.3.3                prettyunits_1.2.0
[35] R6_2.5.1                      stringi_1.8.4
[37] rtracklayer_1.62.0            Rcpp_1.0.13-1
[39] httpuv_1.6.15                 Matrix_1.6-5
[41] tidyselect_1.2.1              abind_1.4-8
[43] yaml_2.3.10                   codetools_0.2-20
[45] curl_6.0.1                    lattice_0.22-6
[47] tibble_3.2.1                  shiny_1.10.0
[49] withr_3.0.2                   KEGGREST_1.42.0
[51] BiocFileCache_2.10.2          xml2_1.3.6
[53] ExperimentHub_2.10.0          Biostrings_2.70.3
[55] pillar_1.10.0                 BiocManager_1.30.25
[57] filelock_1.0.3                generics_0.1.3
[59] RCurl_1.98-1.16               BiocVersion_3.18.1
[61] ensembldb_2.26.0              hms_1.1.3
[63] sparseMatrixStats_1.14.0      xtable_1.8-4
[65] glue_1.8.0                    lazyeval_0.2.2
[67] tools_4.3.3                   AnnotationHub_3.10.1
[69] BiocIO_1.12.0                 ScaledMatrix_1.10.0
[71] GenomicAlignments_1.38.2      XML_3.99-0.18
[73] grid_4.3.3                    AnnotationDbi_1.64.1
[75] GenomeInfoDbData_1.2.11       BiocSingular_1.18.0
[77] restfulr_0.0.15               cli_3.6.3
[79] rsvd_1.0.5                    rappdirs_0.3.3
[81] S4Arrays_1.2.1                dplyr_1.1.4
[83] AnnotationFilter_1.26.0       digest_0.6.37
[85] SparseArray_1.2.4             rjson_0.2.23
[87] memoise_2.0.1                 htmltools_0.5.8.1
[89] lifecycle_1.0.4               httr_1.4.7
[91] mime_0.12                     bit64_4.5.2

As we can see, SingleR 2.8.0 is a little bit faster than 2.4.1 on my machine (Mac M2). If you can't share your dataset, try to reproduce your timings with a dataset in scRNAseq. If you're still getting worse timings for 2.8.0 on BaronPancreasData, then I don't have any more ideas.

Another observation is that, for me, SingleR scales nicely as we double the number of threads from 1 to 2, and then 2 to 4. If you're not observing this, it suggests that there is a serial processing bottleneck - specifically related to your test dataset, given that classifySingleR() is most affected. My best guess is that your test dataset uses a matrix representation that is not amenable to parallel data extraction - for example, HDF5-based arrays only allow serial reading within a process, and unsupported delayed operations force the underlying beachmat code to call R from C++ in order to extract matrix contents. (If such "exotic" matrices are involved, the speed of data extraction depends on other packages like HDF5Array and DelayedArray, and so any differences in timings are largely outside of my control. This can be somewhat circumvented in the HDF5 case by using beachmat.hdf5 but the serial nature of HDF5 is unavoidable here.)

I can say that the slight differences in results are expected and are discussed in the NEWS for 2.8.0, see

SingleR/inst/NEWS.Rd

Lines 34 to 38 in c2922b3

\item The introduction of \code{test.genes=} means that we no longer need to explicitly subset the rows of the reference dataset (to match the test features) in \code{SingleR()}.
This saves memory by avoiding an unnecessary copy of the reference dataset, but may also slightly alter the marker selection as ties are broken in a different way.
Namely, if the top X genes are used as markers, and the X-th and (X+1)-th gene have the same log-fold change,
tie breaking will be based on the ordering of the rows in the reference matrix - which is no longer the same as in the previous version of \pkg{SingleR}.
This results in some slight differences in the markers that propagate down to the classification results.

@ycl6
Copy link
Author

ycl6 commented Jan 5, 2025

Hi @LTLA

Thanks for your explanation. I was indeed using an object that has its assays stored in DelayedMatrix format (loaded the CR data from the h5 file). I was using a demo dataset from 10x Genomics, but I think the the datasets from TENxPBMCData offers examples in DelayedMatrix too, so will be easier to replicate.

Here I am using pbmc33k that has 32,738 genes and 33,148 cells. I didn't run this through my workflow to clean-up or add other bits of info as I did with my previous datasets that seems to slow things down a lot when running SingleR. Here the old version in R 4.3 is only slightly faster. But as you said, there is no significant gain using more num.threads if the assay matrix is a HDF5-based array.

library(SingleR)
library(scran)
library(scuttle)

library(TENxPBMCData)
sce <- TENxPBMCData(dataset = "pbmc33k")
rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ENSEMBL_ID, rowData(sce)$Symbol_TENx)
colnames(sce) <- sce$Barcode

set.seed(1000)
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, cluster=clusters)
sce <- logNormCounts(sce)

bp <- celldex::BlueprintEncodeData()

R 4.4, BioC 3.20

system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test='logcounts'))
##   user  system elapsed 
##126.988   0.317 127.297 
system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test='logcounts', num.threads=2))
##   user  system elapsed 
##169.576   0.241 107.422 
system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test='logcounts', num.threads=4))
##   user  system elapsed 
##169.480   0.185  79.117 

R 4.3, BioC 3.18

system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test='logcounts'))
##   user  system elapsed 
##158.698   0.121 158.803 
system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test='logcounts', num.threads=2))
##   user  system elapsed 
##170.560   0.148 101.268 
system.time(res <- SingleR(sce, bp, labels=bp$label.main, assay.type.test='logcounts', num.threads=4))
##   user  system elapsed 
##181.603   0.191  72.212 

I am now tweaking my workflow to create dgCMatrix assays that can be use for calculations that'll benefit from parallelisation with a BiocParallel interface (not just SingleR) and keep the main object still HDF5-based. It's great to re-work and optimise the workflow using what I learn here. Thanks!

@ycl6 ycl6 closed this as completed Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants