Skip to content

Commit 5ec76dc

Browse files
committed
Update documentation for all functions
1 parent b56ef90 commit 5ec76dc

19 files changed

+2264
-1107
lines changed

Diff for: DESCRIPTION

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
Package: CHOIR
2-
Title: CHOIR - Clustering Hierachy Optimization by Iterative Random forests
2+
Title: CHOIR - Cluster Hierachy Optimization by Iterative Random forests
33
Version: 0.2.0
44
Authors@R:
5-
person("Cathrine", "Petersen", , "cathrine.petersen@gladstone.ucsf.edu", role = c("aut", "cre"),
5+
person("Cathrine", "Sant", , "cathrine.sant@gladstone.ucsf.edu", role = c("aut", "cre"),
66
comment = c(ORCID = "0000-0002-5821-9828"))
77
Description: CHOIR is a clustering algorithm for single-cell sequencing data. CHOIR applies a framework of permutation tests and random forest classifiers across a hierarchical clustering tree to statistically identify clusters that represent distinct populations.
88
License: MIT + file LICENSE

Diff for: NAMESPACE

+1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ export(CHOIR)
55
export(CHOIRpalette)
66
export(buildParentTree)
77
export(buildTree)
8+
export(combineTrees)
89
export(compareClusters)
910
export(inferTree)
1011
export(plotCHOIR)

Diff for: R/CHOIR.R

+253-137
Large diffs are not rendered by default.

Diff for: R/HelperUtils.R

+1-2
Original file line numberDiff line numberDiff line change
@@ -674,8 +674,7 @@
674674

675675
.getNewLabels <- function(merge_groups,
676676
level,
677-
compiled_labels,
678-
) {
677+
compiled_labels) {
679678

680679
# Create new list
681680
merge_group_labels <- vector(mode = "list", length(merge_groups))

Diff for: R/PlottingUtils.R

+10-8
Original file line numberDiff line numberDiff line change
@@ -59,17 +59,19 @@ CHOIRpalette <- function(n) {
5959
#' Simplifies running \code{Seurat::RunUMAP()} after CHOIR clustering by
6060
#' automatically fetching the pre-generated dimensionality reductions.
6161
#'
62-
#' @param object An object of class 'Seurat', 'SingleCellExperiment', or
63-
#' 'ArchRProject' that has undergone CHOIR clustering.
64-
#' @param key The name under which CHOIR-related data for this run is retrieved
65-
#' from the object. Defaults to 'CHOIR'.
62+
#' @param object An object of class \code{Seurat}, \code{SingleCellExperiment},
63+
#' or \code{ArchRProject} that has undergone CHOIR clustering. For multi-omic
64+
#' data, we recommend using \code{ArchRProject} objects.
65+
#' @param key The name under which CHOIR-related data for this run is stored in
66+
#' the object. Defaults to “CHOIR”.
6667
#' @param reduction A character vector indicating which CHOIR subtree
6768
#' dimensionality reduction to run UMAP on (e.g., 'P0_reduction',
6869
#' 'P1_reduction'). Default = \code{NULL} will run UMAP on all of the
69-
#' dimensionality reductions generated by CHOIR stored under the provided 'key'.
70-
#' @param verbose A boolean value indicating whether to use verbose output
71-
#' during the execution of this function. Can be set to \code{FALSE} for a
72-
#' cleaner output.
70+
#' dimensionality reductions generated by CHOIR stored under the provided
71+
#' \code{key}.
72+
#' @param verbose A Boolean value indicating whether to use verbose output
73+
#' during the execution of CHOIR. Defaults to \code{TRUE}, but can be set to
74+
#' \code{FALSE} for a cleaner output.
7375
#'
7476
#' @return Returns the object with the following added data stored under the
7577
#' provided key: \describe{

Diff for: R/buildTree.R

+230-128
Large diffs are not rendered by default.

Diff for: R/combineTrees.R

+188-33
Original file line numberDiff line numberDiff line change
@@ -6,45 +6,200 @@
66
#' into a single compiled tree, which is then further pruned to standardize
77
#' thresholds across each subtree.
88
#'
9-
#' @param object An object of class 'Seurat', 'SingleCellExperiment', or
10-
#' 'ArchRProject', output after running buildParentTree.
9+
#' @param object An object of class \code{Seurat}, \code{SingleCellExperiment},
10+
#' or \code{ArchRProject} that was output from function \code{buildParentTree}.
11+
#' For multi-omic data, we recommend using \code{ArchRProject} objects.
1112
#' @param subtree_list A list containing the CHOIR records from each subtree.
1213
#' @param key The name under which CHOIR-related data for this run is stored in
13-
#' the object. Defaults to 'CHOIR'.
14+
#' the object. Defaults to CHOIR.
1415
#' @param alpha A numerical value indicating the significance level used for
1516
#' permutation test comparisons of cluster distinguishability. Defaults to 0.05.
16-
#' @param p_adjust A string indicating which multiple comparison
17-
#' adjustment to use. Permitted values are 'fdr', 'bonferroni', and 'none'.
18-
#' Defaults to 'bonferroni'.
19-
#' @param feature_set
20-
#' @param exclude_features
21-
#' @param n_iterations
22-
#' @param n_trees
23-
#' @param use_variance
24-
#' @param min_accuracy
25-
#' @param min_connections
26-
#' @param max_repeat_errors
27-
#' @param distance_approx
28-
#' @param distance_awareness
29-
#' @param collect_all_metrics
30-
#' @param sample_max
31-
#' @param downsampling_rate
32-
#' @param normalization_method
33-
#' @param batch_correction_method
34-
#' @param batch_labels
35-
#' @param use_assay
36-
#' @param input_matrix
37-
#' @param nn_matrix
38-
#' @param dist_matrix
39-
#' @param reduction
40-
#' @param n_cores
41-
#' @param random_seed
42-
#' @param verbose
17+
#' Decreasing the alpha value will yield more conservative clusters (fewer
18+
#' clusters) and will often decrease the computational time required, because
19+
#' fewer cluster comparisons may be needed.
20+
#' @param p_adjust A string indicating which multiple comparison adjustment
21+
#' method to use. Permitted values are “bonferroni”, “fdr”, and “none”. Defaults
22+
#' to “bonferroni”. Other correction methods may be less conservative,
23+
#' identifying more clusters, as CHOIR applies filters that reduce the total
24+
#' number of tests performed.
25+
#' @param feature_set A string indicating whether to train random forest
26+
#' classifiers on “all” features or only variable (“var”) features. Defaults to
27+
#' “var”. Computational time and memory required may increase if more features
28+
#' are used. Using all features instead of variable features may result in more
29+
#' conservative cluster calls.
30+
#' @param exclude_features A character vector indicating features that should be
31+
#' excluded from input to the random forest classifier. Defaults to \code{NULL},
32+
#' which means that no features will be excluded. This parameter can be used,
33+
#' for example, to exclude features correlated with cell quality, such as
34+
#' mitochondrial genes. Failure to exclude problematic features could result in
35+
#' clusters driven by cell quality, while over-exclusion of features could
36+
#' reduce the ability of CHOIR to distinguish cell populations that differ by
37+
#' those features.
38+
#' @param n_iterations A numerical value indicating the number of iterations run
39+
#' for each permutation test comparison. Increasing the number of iterations
40+
#' will approximately linearly increase the computational time required but
41+
#' provide a more accurate estimation of the significance of the permutation
42+
#' test. Decreasing the number of iterations runs the risk of leading to
43+
#' underclustering due to lack of statistical power. The default value, 100
44+
#' iterations, was selected because it avoids underclustering, while minimizing
45+
#' computational time and the diminishing returns from running CHOIR with
46+
#' additional iterations.
47+
#' @param n_trees A numerical value indicating the number of trees in each
48+
#' random forest. Defaults to 50. Increasing the number of trees is likely to
49+
#' increase the computational time required. Though not entirely predictable,
50+
#' increasing the number of trees up to a point may enable more nuanced
51+
#' distinctions, but is likely to provide diminishing returns.
52+
#' @param use_variance A Boolean value indicating whether to use the variance of
53+
#' the random forest accuracy scores as part of the permutation test threshold.
54+
#' Defaults to \code{TRUE}. Setting this parameter to \code{FALSE} will make
55+
#' CHOIR considerably less conservative, identifying more clusters, particularly
56+
#' on large datasets.
57+
#' @param min_accuracy A numerical value indicating the minimum accuracy
58+
#' required of the random forest classifier, below which clusters will be
59+
#' automatically merged. Defaults to 0.5, representing the random chance
60+
#' probability of assigning correct cluster labels; therefore, decreasing the
61+
#' minimum accuracy is not recommended. Increasing the minimum accuracy will
62+
#' lead to more conservative cluster assignments and will often decrease the
63+
#' computational time required, because fewer cluster comparisons may be needed.
64+
#' @param min_connections A numerical value indicating the minimum number of
65+
#' nearest neighbors between two clusters for those clusters to be considered
66+
#' adjacent. Non-adjacent clusters will not be merged. Defaults to 1. This
67+
#' threshold allows CHOIR to avoid running the full permutation test comparison
68+
#' for clusters that are highly likely to be distinct, saving computational
69+
#' time. Therefore, setting this parameter to 0 will increase the number of
70+
#' permutation test comparisons run and, thus, the computational time. The
71+
#' intent of this parameter is only to avoid running permutation test
72+
#' comparisons between clusters that are so different that they should not be
73+
#' merged. Therefore, we do not recommend increasing this parameter value
74+
#' beyond 10, as higher values may result in instances of overclustering.
75+
#' @param max_repeat_errors A numerical value indicating the maximum number of
76+
#' repeatedly mislabeled cells that will be taken into account during the
77+
#' permutation tests. This parameter is used to account for situations in which
78+
#' random forest classifier errors are concentrated among a few cells that are
79+
#' repeatedly misassigned. If set to 0, such repeat errors will not be
80+
#' evaluated. Defaults to 20. These situations are relatively infrequent, but
81+
#' setting this parameter to lower values (especially 0) may result in
82+
#' underclustering due to a small number of intermediate cells. Setting this
83+
#' parameter to higher values may lead to instances of overclustering and is not
84+
#' recommended.
85+
#' @param distance_approx A Boolean value indicating whether or not to use
86+
#' approximate distance calculations. Defaults to \code{TRUE}, which will use
87+
#' centroid-based distances. Setting distance approximation to \code{FALSE} will
88+
#' substantially increase the computational time and memory required,
89+
#' particularly for large datasets. Using approximated distances (\code{TRUE})
90+
#' rather than absolute distances (\code{FALSE}) is unlikely to have a
91+
#' meaningful effect on the distance thresholds imposed by CHOIR.
92+
#' @param distance_awareness A numerical value representing the distance
93+
#' threshold above which a cluster will not merge with another cluster and
94+
#' significance testing will not be used. Specifically, this value is a
95+
#' multiplier applied to the distance between a cluster and its closest
96+
#' distinguishable neighbor based on random forest comparison. Defaults to 2,
97+
#' which sets this threshold at a two-fold increase in distance over the closest
98+
#' distinguishable neighbor. This threshold allows CHOIR to avoid running the
99+
#' full permutation test comparison for clusters that are highly likely to be
100+
#' distinct, saving computational time. To omit all distance calculations and
101+
#' perform permutation testing on all comparisons, set this parameter to
102+
#' \code{FALSE}. Setting this parameter to \code{FALSE} or increasing the input
103+
#' value will increase the number of permutation test comparisons run and, thus,
104+
#' the computational time. In rare cases, very small distant clusters may be
105+
#' erroneously merged when distance thresholds are not used. The intent of this
106+
#' parameter is only to avoid running permutation test comparisons between
107+
#' clusters that are so different that they should not be merged. We do not
108+
#' recommend decreasing this parameter value below 1.5, as lower values may
109+
#' result in instances of overclustering.
110+
#' @param collect_all_metrics A Boolean value indicating whether to collect and
111+
#' save additional metrics from the random forest classifiers, including feature
112+
#' importances for every comparison. Defaults to \code{FALSE}. Setting this
113+
#' parameter to \code{TRUE} will slightly increase the computational time
114+
#' required. This parameter has no effect on the final cluster calls.
115+
#' @param sample_max A numerical value indicating the maximum number of cells to
116+
#' be sampled per cluster to train/test each random forest classifier. Defaults
117+
#' to \code{Inf} (infinity), which does not cap the number of cells used, so all
118+
#' cells will be used in all comparisons. Decreasing this parameter may decrease
119+
#' the computational time required, but may result in instances of
120+
#' underclustering. If input is provided to both the \code{downsampling_rate}
121+
#' and \code{sample_max} parameters, the minimum resulting cell number is
122+
#' calculated and used for each comparison.
123+
#' @param downsampling_rate A numerical value indicating the proportion of cells
124+
#' to be sampled per cluster to train/test each random forest classifier. For
125+
#' efficiency, the default value, "auto", sets the downsampling rate according
126+
#' to the dataset size. Decreasing this parameter may decrease the computational
127+
#' time required, but may also make the final cluster calls more conservative.
128+
#' If input is provided to both \code{downsampling_rate} and
129+
#' \code{sample_max parameters}, the minimum resulting cell number is calculated
130+
#' and used for each comparison.
131+
#' @param min_reads A numeric value used to filter out features prior to input
132+
#' to the random forest classifier. The default value, \code{NULL}, will filter
133+
#' out features with 0 counts for the current clusters being compared. Higher
134+
#' values should be used with caution, but may increase the signal-to-noise
135+
#' ratio encountered by the random forest classifiers.
136+
#' @param normalization_method A character string or vector indicating which
137+
#' normalization method to use. In general, input data should be supplied to
138+
#' CHOIR after normalization, except when the user wishes to use
139+
#' \code{Seurat SCTransform} normalization. Permitted values are “none” or
140+
#' “SCTransform”. Defaults to “none”. Because CHOIR has not been tested
141+
#' thoroughly with \code{SCTransform} normalization, we do not recommend this
142+
#' approach at this time. For multi-omic datasets, provide a vector with a value
143+
#' corresponding to each provided value of \code{use_assay} or
144+
#' \code{ArchR_matrix} in the same order.
145+
#' @param batch_correction_method A character string indicating which batch
146+
#' correction method to use. Permitted values are “Harmony” and “none”. Defaults
147+
#' to “none”. Batch correction should only be used when the different batches
148+
#' are not expected to also have unique cell types or cell states. Using batch
149+
#' correction would ensure that clusters do not originate from a single batch,
150+
#' thereby making the final cluster calls more conservative.
151+
#' @param batch_labels A character string that, if applying batch correction,
152+
#' specifies the name of the column in the input object metadata containing the
153+
#' batch labels. Defaults to \code{NULL}.
154+
#' @param use_assay For \code{Seurat} or \code{SingleCellExperiment} objects, a
155+
#' character string or vector indicating the assay(s) to use in the provided
156+
#' object. The default value, \code{NULL}, will choose the current active assay
157+
#' for \code{Seurat} objects and the \code{logcounts} assay for
158+
#' \code{SingleCellExperiment} objects.
159+
#' @param input_matrix An optional matrix containing the feature x cell data
160+
#' provided by the user, on which to train the random forest classifiers. By
161+
#' default, this parameter is set to \code{NULL}, and CHOIR will look for the
162+
#' feature x cell matri(ces) indicated by function \code{buildParentTree}. ##### TRUE??? #####
163+
#' @param nn_matrix An optional matrix containing the nearest neighbor adjacency
164+
#' of the cells, provided by the user. By default, this parameter is set to
165+
#' \code{NULL}, and CHOIR will look for the adjacency matri(ces) generated by
166+
#' function \code{buildParentTree}. ##### TRUE??? #####
167+
#' @param dist_matrix An optional distance matrix of cell to cell distances
168+
#' (based on dimensionality reduction cell embeddings), provided by the user. By
169+
#' default, this parameter is set to \code{NULL}, and CHOIR will look for the
170+
#' distance matri(ces) generated by function \code{buildParentTree}. ##### TRUE??? #####
171+
#' @param reduction An optional matrix of dimensionality reduction cell
172+
#' embeddings provided by the user for subsequent clustering steps. By default,
173+
#' this parameter is set to \code{NULL}, and CHOIR will look for the
174+
#' dimensionality reductions generated by function \code{buildParentTree()}.
175+
#' @param n_cores A numerical value indicating the number of cores to use for
176+
#' parallelization. By default, CHOIR will use the number of available cores
177+
#' minus 2. CHOIR is parallelized at the computation of permutation test
178+
#' iterations. Therefore, any number of cores up to the number of iterations
179+
#' will theoretically decrease the computational time required. In practice,
180+
#' 8–16 cores are recommended for datasets up to 500,000 cells.
181+
#' @param random_seed A numerical value indicating the random seed to be used.
182+
#' Defaults to 1. CHOIR uses randomization throughout the generation and pruning
183+
#' of the clustering tree. Therefore, changing the random seed may yield slight
184+
#' differences in the final cluster assignments.
185+
#' @param verbose A Boolean value indicating whether to use verbose output
186+
#' during the execution of CHOIR. Defaults to \code{TRUE}, but can be set to
187+
#' \code{FALSE} for a cleaner output.
43188
#'
44-
#' @return
45-
#' @export
46189
#'
47-
#' @examples
190+
#' ############ COUNTSPLIT??
191+
#'
192+
#'@return Returns the object with the following added data stored under the
193+
#' provided key: \describe{
194+
#' \item{clusters}{Final clusters, full hierarchical cluster tree, and
195+
#' stepwise cluster results for each progressive pruning step}
196+
#' \item{parameters}{Record of parameter values used}
197+
#' \item{records}{Metadata for decision points during hierarchical tree
198+
#' construction, all recorded permutation test comparisons, and feature
199+
#' importance scores from all comparisons}
200+
#' }
201+
#'
202+
#' @export
48203
combineTrees <- function(object,
49204
subtree_list,
50205
key = "CHOIR",

0 commit comments

Comments
 (0)