|
6 | 6 | #' into a single compiled tree, which is then further pruned to standardize
|
7 | 7 | #' thresholds across each subtree.
|
8 | 8 | #'
|
9 |
| -#' @param object An object of class 'Seurat', 'SingleCellExperiment', or |
10 |
| -#' 'ArchRProject', output after running buildParentTree. |
| 9 | +#' @param object An object of class \code{Seurat}, \code{SingleCellExperiment}, |
| 10 | +#' or \code{ArchRProject} that was output from function \code{buildParentTree}. |
| 11 | +#' For multi-omic data, we recommend using \code{ArchRProject} objects. |
11 | 12 | #' @param subtree_list A list containing the CHOIR records from each subtree.
|
12 | 13 | #' @param key The name under which CHOIR-related data for this run is stored in
|
13 |
| -#' the object. Defaults to 'CHOIR'. |
| 14 | +#' the object. Defaults to “CHOIR”. |
14 | 15 | #' @param alpha A numerical value indicating the significance level used for
|
15 | 16 | #' permutation test comparisons of cluster distinguishability. Defaults to 0.05.
|
16 |
| -#' @param p_adjust A string indicating which multiple comparison |
17 |
| -#' adjustment to use. Permitted values are 'fdr', 'bonferroni', and 'none'. |
18 |
| -#' Defaults to 'bonferroni'. |
19 |
| -#' @param feature_set |
20 |
| -#' @param exclude_features |
21 |
| -#' @param n_iterations |
22 |
| -#' @param n_trees |
23 |
| -#' @param use_variance |
24 |
| -#' @param min_accuracy |
25 |
| -#' @param min_connections |
26 |
| -#' @param max_repeat_errors |
27 |
| -#' @param distance_approx |
28 |
| -#' @param distance_awareness |
29 |
| -#' @param collect_all_metrics |
30 |
| -#' @param sample_max |
31 |
| -#' @param downsampling_rate |
32 |
| -#' @param normalization_method |
33 |
| -#' @param batch_correction_method |
34 |
| -#' @param batch_labels |
35 |
| -#' @param use_assay |
36 |
| -#' @param input_matrix |
37 |
| -#' @param nn_matrix |
38 |
| -#' @param dist_matrix |
39 |
| -#' @param reduction |
40 |
| -#' @param n_cores |
41 |
| -#' @param random_seed |
42 |
| -#' @param verbose |
| 17 | +#' Decreasing the alpha value will yield more conservative clusters (fewer |
| 18 | +#' clusters) and will often decrease the computational time required, because |
| 19 | +#' fewer cluster comparisons may be needed. |
| 20 | +#' @param p_adjust A string indicating which multiple comparison adjustment |
| 21 | +#' method to use. Permitted values are “bonferroni”, “fdr”, and “none”. Defaults |
| 22 | +#' to “bonferroni”. Other correction methods may be less conservative, |
| 23 | +#' identifying more clusters, as CHOIR applies filters that reduce the total |
| 24 | +#' number of tests performed. |
| 25 | +#' @param feature_set A string indicating whether to train random forest |
| 26 | +#' classifiers on “all” features or only variable (“var”) features. Defaults to |
| 27 | +#' “var”. Computational time and memory required may increase if more features |
| 28 | +#' are used. Using all features instead of variable features may result in more |
| 29 | +#' conservative cluster calls. |
| 30 | +#' @param exclude_features A character vector indicating features that should be |
| 31 | +#' excluded from input to the random forest classifier. Defaults to \code{NULL}, |
| 32 | +#' which means that no features will be excluded. This parameter can be used, |
| 33 | +#' for example, to exclude features correlated with cell quality, such as |
| 34 | +#' mitochondrial genes. Failure to exclude problematic features could result in |
| 35 | +#' clusters driven by cell quality, while over-exclusion of features could |
| 36 | +#' reduce the ability of CHOIR to distinguish cell populations that differ by |
| 37 | +#' those features. |
| 38 | +#' @param n_iterations A numerical value indicating the number of iterations run |
| 39 | +#' for each permutation test comparison. Increasing the number of iterations |
| 40 | +#' will approximately linearly increase the computational time required but |
| 41 | +#' provide a more accurate estimation of the significance of the permutation |
| 42 | +#' test. Decreasing the number of iterations runs the risk of leading to |
| 43 | +#' underclustering due to lack of statistical power. The default value, 100 |
| 44 | +#' iterations, was selected because it avoids underclustering, while minimizing |
| 45 | +#' computational time and the diminishing returns from running CHOIR with |
| 46 | +#' additional iterations. |
| 47 | +#' @param n_trees A numerical value indicating the number of trees in each |
| 48 | +#' random forest. Defaults to 50. Increasing the number of trees is likely to |
| 49 | +#' increase the computational time required. Though not entirely predictable, |
| 50 | +#' increasing the number of trees up to a point may enable more nuanced |
| 51 | +#' distinctions, but is likely to provide diminishing returns. |
| 52 | +#' @param use_variance A Boolean value indicating whether to use the variance of |
| 53 | +#' the random forest accuracy scores as part of the permutation test threshold. |
| 54 | +#' Defaults to \code{TRUE}. Setting this parameter to \code{FALSE} will make |
| 55 | +#' CHOIR considerably less conservative, identifying more clusters, particularly |
| 56 | +#' on large datasets. |
| 57 | +#' @param min_accuracy A numerical value indicating the minimum accuracy |
| 58 | +#' required of the random forest classifier, below which clusters will be |
| 59 | +#' automatically merged. Defaults to 0.5, representing the random chance |
| 60 | +#' probability of assigning correct cluster labels; therefore, decreasing the |
| 61 | +#' minimum accuracy is not recommended. Increasing the minimum accuracy will |
| 62 | +#' lead to more conservative cluster assignments and will often decrease the |
| 63 | +#' computational time required, because fewer cluster comparisons may be needed. |
| 64 | +#' @param min_connections A numerical value indicating the minimum number of |
| 65 | +#' nearest neighbors between two clusters for those clusters to be considered |
| 66 | +#' adjacent. Non-adjacent clusters will not be merged. Defaults to 1. This |
| 67 | +#' threshold allows CHOIR to avoid running the full permutation test comparison |
| 68 | +#' for clusters that are highly likely to be distinct, saving computational |
| 69 | +#' time. Therefore, setting this parameter to 0 will increase the number of |
| 70 | +#' permutation test comparisons run and, thus, the computational time. The |
| 71 | +#' intent of this parameter is only to avoid running permutation test |
| 72 | +#' comparisons between clusters that are so different that they should not be |
| 73 | +#' merged. Therefore, we do not recommend increasing this parameter value |
| 74 | +#' beyond 10, as higher values may result in instances of overclustering. |
| 75 | +#' @param max_repeat_errors A numerical value indicating the maximum number of |
| 76 | +#' repeatedly mislabeled cells that will be taken into account during the |
| 77 | +#' permutation tests. This parameter is used to account for situations in which |
| 78 | +#' random forest classifier errors are concentrated among a few cells that are |
| 79 | +#' repeatedly misassigned. If set to 0, such repeat errors will not be |
| 80 | +#' evaluated. Defaults to 20. These situations are relatively infrequent, but |
| 81 | +#' setting this parameter to lower values (especially 0) may result in |
| 82 | +#' underclustering due to a small number of intermediate cells. Setting this |
| 83 | +#' parameter to higher values may lead to instances of overclustering and is not |
| 84 | +#' recommended. |
| 85 | +#' @param distance_approx A Boolean value indicating whether or not to use |
| 86 | +#' approximate distance calculations. Defaults to \code{TRUE}, which will use |
| 87 | +#' centroid-based distances. Setting distance approximation to \code{FALSE} will |
| 88 | +#' substantially increase the computational time and memory required, |
| 89 | +#' particularly for large datasets. Using approximated distances (\code{TRUE}) |
| 90 | +#' rather than absolute distances (\code{FALSE}) is unlikely to have a |
| 91 | +#' meaningful effect on the distance thresholds imposed by CHOIR. |
| 92 | +#' @param distance_awareness A numerical value representing the distance |
| 93 | +#' threshold above which a cluster will not merge with another cluster and |
| 94 | +#' significance testing will not be used. Specifically, this value is a |
| 95 | +#' multiplier applied to the distance between a cluster and its closest |
| 96 | +#' distinguishable neighbor based on random forest comparison. Defaults to 2, |
| 97 | +#' which sets this threshold at a two-fold increase in distance over the closest |
| 98 | +#' distinguishable neighbor. This threshold allows CHOIR to avoid running the |
| 99 | +#' full permutation test comparison for clusters that are highly likely to be |
| 100 | +#' distinct, saving computational time. To omit all distance calculations and |
| 101 | +#' perform permutation testing on all comparisons, set this parameter to |
| 102 | +#' \code{FALSE}. Setting this parameter to \code{FALSE} or increasing the input |
| 103 | +#' value will increase the number of permutation test comparisons run and, thus, |
| 104 | +#' the computational time. In rare cases, very small distant clusters may be |
| 105 | +#' erroneously merged when distance thresholds are not used. The intent of this |
| 106 | +#' parameter is only to avoid running permutation test comparisons between |
| 107 | +#' clusters that are so different that they should not be merged. We do not |
| 108 | +#' recommend decreasing this parameter value below 1.5, as lower values may |
| 109 | +#' result in instances of overclustering. |
| 110 | +#' @param collect_all_metrics A Boolean value indicating whether to collect and |
| 111 | +#' save additional metrics from the random forest classifiers, including feature |
| 112 | +#' importances for every comparison. Defaults to \code{FALSE}. Setting this |
| 113 | +#' parameter to \code{TRUE} will slightly increase the computational time |
| 114 | +#' required. This parameter has no effect on the final cluster calls. |
| 115 | +#' @param sample_max A numerical value indicating the maximum number of cells to |
| 116 | +#' be sampled per cluster to train/test each random forest classifier. Defaults |
| 117 | +#' to \code{Inf} (infinity), which does not cap the number of cells used, so all |
| 118 | +#' cells will be used in all comparisons. Decreasing this parameter may decrease |
| 119 | +#' the computational time required, but may result in instances of |
| 120 | +#' underclustering. If input is provided to both the \code{downsampling_rate} |
| 121 | +#' and \code{sample_max} parameters, the minimum resulting cell number is |
| 122 | +#' calculated and used for each comparison. |
| 123 | +#' @param downsampling_rate A numerical value indicating the proportion of cells |
| 124 | +#' to be sampled per cluster to train/test each random forest classifier. For |
| 125 | +#' efficiency, the default value, "auto", sets the downsampling rate according |
| 126 | +#' to the dataset size. Decreasing this parameter may decrease the computational |
| 127 | +#' time required, but may also make the final cluster calls more conservative. |
| 128 | +#' If input is provided to both \code{downsampling_rate} and |
| 129 | +#' \code{sample_max parameters}, the minimum resulting cell number is calculated |
| 130 | +#' and used for each comparison. |
| 131 | +#' @param min_reads A numeric value used to filter out features prior to input |
| 132 | +#' to the random forest classifier. The default value, \code{NULL}, will filter |
| 133 | +#' out features with 0 counts for the current clusters being compared. Higher |
| 134 | +#' values should be used with caution, but may increase the signal-to-noise |
| 135 | +#' ratio encountered by the random forest classifiers. |
| 136 | +#' @param normalization_method A character string or vector indicating which |
| 137 | +#' normalization method to use. In general, input data should be supplied to |
| 138 | +#' CHOIR after normalization, except when the user wishes to use |
| 139 | +#' \code{Seurat SCTransform} normalization. Permitted values are “none” or |
| 140 | +#' “SCTransform”. Defaults to “none”. Because CHOIR has not been tested |
| 141 | +#' thoroughly with \code{SCTransform} normalization, we do not recommend this |
| 142 | +#' approach at this time. For multi-omic datasets, provide a vector with a value |
| 143 | +#' corresponding to each provided value of \code{use_assay} or |
| 144 | +#' \code{ArchR_matrix} in the same order. |
| 145 | +#' @param batch_correction_method A character string indicating which batch |
| 146 | +#' correction method to use. Permitted values are “Harmony” and “none”. Defaults |
| 147 | +#' to “none”. Batch correction should only be used when the different batches |
| 148 | +#' are not expected to also have unique cell types or cell states. Using batch |
| 149 | +#' correction would ensure that clusters do not originate from a single batch, |
| 150 | +#' thereby making the final cluster calls more conservative. |
| 151 | +#' @param batch_labels A character string that, if applying batch correction, |
| 152 | +#' specifies the name of the column in the input object metadata containing the |
| 153 | +#' batch labels. Defaults to \code{NULL}. |
| 154 | +#' @param use_assay For \code{Seurat} or \code{SingleCellExperiment} objects, a |
| 155 | +#' character string or vector indicating the assay(s) to use in the provided |
| 156 | +#' object. The default value, \code{NULL}, will choose the current active assay |
| 157 | +#' for \code{Seurat} objects and the \code{logcounts} assay for |
| 158 | +#' \code{SingleCellExperiment} objects. |
| 159 | +#' @param input_matrix An optional matrix containing the feature x cell data |
| 160 | +#' provided by the user, on which to train the random forest classifiers. By |
| 161 | +#' default, this parameter is set to \code{NULL}, and CHOIR will look for the |
| 162 | +#' feature x cell matri(ces) indicated by function \code{buildParentTree}. ##### TRUE??? ##### |
| 163 | +#' @param nn_matrix An optional matrix containing the nearest neighbor adjacency |
| 164 | +#' of the cells, provided by the user. By default, this parameter is set to |
| 165 | +#' \code{NULL}, and CHOIR will look for the adjacency matri(ces) generated by |
| 166 | +#' function \code{buildParentTree}. ##### TRUE??? ##### |
| 167 | +#' @param dist_matrix An optional distance matrix of cell to cell distances |
| 168 | +#' (based on dimensionality reduction cell embeddings), provided by the user. By |
| 169 | +#' default, this parameter is set to \code{NULL}, and CHOIR will look for the |
| 170 | +#' distance matri(ces) generated by function \code{buildParentTree}. ##### TRUE??? ##### |
| 171 | +#' @param reduction An optional matrix of dimensionality reduction cell |
| 172 | +#' embeddings provided by the user for subsequent clustering steps. By default, |
| 173 | +#' this parameter is set to \code{NULL}, and CHOIR will look for the |
| 174 | +#' dimensionality reductions generated by function \code{buildParentTree()}. |
| 175 | +#' @param n_cores A numerical value indicating the number of cores to use for |
| 176 | +#' parallelization. By default, CHOIR will use the number of available cores |
| 177 | +#' minus 2. CHOIR is parallelized at the computation of permutation test |
| 178 | +#' iterations. Therefore, any number of cores up to the number of iterations |
| 179 | +#' will theoretically decrease the computational time required. In practice, |
| 180 | +#' 8–16 cores are recommended for datasets up to 500,000 cells. |
| 181 | +#' @param random_seed A numerical value indicating the random seed to be used. |
| 182 | +#' Defaults to 1. CHOIR uses randomization throughout the generation and pruning |
| 183 | +#' of the clustering tree. Therefore, changing the random seed may yield slight |
| 184 | +#' differences in the final cluster assignments. |
| 185 | +#' @param verbose A Boolean value indicating whether to use verbose output |
| 186 | +#' during the execution of CHOIR. Defaults to \code{TRUE}, but can be set to |
| 187 | +#' \code{FALSE} for a cleaner output. |
43 | 188 | #'
|
44 |
| -#' @return |
45 |
| -#' @export |
46 | 189 | #'
|
47 |
| -#' @examples |
| 190 | +#' ############ COUNTSPLIT?? |
| 191 | +#' |
| 192 | +#'@return Returns the object with the following added data stored under the |
| 193 | +#' provided key: \describe{ |
| 194 | +#' \item{clusters}{Final clusters, full hierarchical cluster tree, and |
| 195 | +#' stepwise cluster results for each progressive pruning step} |
| 196 | +#' \item{parameters}{Record of parameter values used} |
| 197 | +#' \item{records}{Metadata for decision points during hierarchical tree |
| 198 | +#' construction, all recorded permutation test comparisons, and feature |
| 199 | +#' importance scores from all comparisons} |
| 200 | +#' } |
| 201 | +#' |
| 202 | +#' @export |
48 | 203 | combineTrees <- function(object,
|
49 | 204 | subtree_list,
|
50 | 205 | key = "CHOIR",
|
|
0 commit comments