From b014faf91a4d76a9057a4e183270fb35a23de014 Mon Sep 17 00:00:00 2001 From: Tom White Date: Wed, 2 Oct 2019 10:14:17 +0100 Subject: [PATCH] Document reproducibility guarantees. Update 'Is there GPU or multicore-CPU support?' FAQ --- doc/faq.rst | 48 ++++++++++++++++++++++++++++++------------------ umap/umap_.py | 6 ++++-- 2 files changed, 34 insertions(+), 20 deletions(-) diff --git a/doc/faq.rst b/doc/faq.rst index 34635aa7..1338f55b 100644 --- a/doc/faq.rst +++ b/doc/faq.rst @@ -95,26 +95,38 @@ issues. Is there GPU or multicore-CPU support? -------------------------------------- -Not at this time. The bottlenecks in the code are the -(approximate) nearest neighbor search and the optimization -of the low dimensional representation. The first of these -(ANN) is performed by a random projection forest and -nearest-neighbor-descent. Both of those are, at the least, -parellelisable in principle, and could be converted to -support multicore (at the cost of single core performance). -The optimization is performed via a (slightly custom) -stochastic gradient descent. SGD is both parallelisable -and amenable to GPUs. This means that in principle UMAP -could support multicore and use GPUs for optimization. -In practice this would involve GPU expertise and would -potentially hurt single core performance, and so has -been deferred for now. If you have expertise in GPU -programming with Numba and would be interested in -adding GPU support we would welcome your contributions. - There is a UMAP implementation for GPU available in the NVIDIA RAPIDS cuML library, so if you need GPU -support that is currently the best palce to go. +support that is currently the best place to go. + +For multicore CPU, the two main bottlenecks in the code are the +(approximate) nearest neighbor search and the optimization of the low +dimensional representation. The first of these has a multicore implementation +in the pynndescent library, which is used by UMAP if it is installed. +Otherwise UMAP uses its own version of nearest neighbor search, which is not +multicore. The second bottleneck, the optimization of the low dimensional +representation is performed via a (slightly custom) stochastic gradient +descent. SGD in UMAP can take advantage of multicore, but only if +`random_state` is set to `None`, which is the default (as explained in the +next question). + +Is the output of UMAP reproducible? +----------------------------------- + +Yes, but not by default. The random seed used by UMAP is not set by default +(`random_state` is set to `None`), so the resulting output embedding will +change if run repeatedly on the same input. UMAP is a stochastic algorithm, +so it is advisable to run it several times with no random seed set to confirm +that the conclusions you draw from the output are not affected by the +randomness in the algorithm. (Credit to Vito Zanotelli for this suggestion.) +Then once you are happy with the results, fix the seed to ensure the output is +reproducible. Having reproducible visual output is very useful to identically +reproduce an image for a paper, or to provide others with code that will +exactly reproduce your results. + +When `random_state` is `None` the algorithm runs faster since it can take +advantage of multiple cores for some parts of the algorithm. This optimization +is not possible in the current implementation when a seed is set. Can I add a custom loss function? --------------------------------- diff --git a/umap/umap_.py b/umap/umap_.py index 4de3949d..e7cefdb6 100644 --- a/umap/umap_.py +++ b/umap/umap_.py @@ -960,7 +960,7 @@ def simplicial_set_embedding( parallel: bool (optional, default False) Whether to run the computation using numba parallel. - Running in parallel is non-deterministic, and is not used + Running in parallel is non-deterministic, and should not be used if a random seed has been set, to ensure reproducibility. verbose: bool (optional, default False) @@ -1257,7 +1257,9 @@ class UMAP(BaseEstimator): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used - by `np.random`. + by `np.random`. Furthermore, when set to None, UMAP can make use + of multiple cores for some parts of the algorithm that means it runs + faster, but at the expense of reproducibility. metric_kwds: dict (optional, default None) Arguments to pass on to the metric, such as the ``p`` value for