Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document reproducibility (#298) #304

Open
wants to merge 1 commit into
base: 0.4dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 30 additions & 18 deletions doc/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -95,26 +95,38 @@ issues.
Is there GPU or multicore-CPU support?
--------------------------------------

Not at this time. The bottlenecks in the code are the
(approximate) nearest neighbor search and the optimization
of the low dimensional representation. The first of these
(ANN) is performed by a random projection forest and
nearest-neighbor-descent. Both of those are, at the least,
parellelisable in principle, and could be converted to
support multicore (at the cost of single core performance).
The optimization is performed via a (slightly custom)
stochastic gradient descent. SGD is both parallelisable
and amenable to GPUs. This means that in principle UMAP
could support multicore and use GPUs for optimization.
In practice this would involve GPU expertise and would
potentially hurt single core performance, and so has
been deferred for now. If you have expertise in GPU
programming with Numba and would be interested in
adding GPU support we would welcome your contributions.

There is a UMAP implementation for GPU available in
the NVIDIA RAPIDS cuML library, so if you need GPU
support that is currently the best palce to go.
support that is currently the best place to go.

For multicore CPU, the two main bottlenecks in the code are the
(approximate) nearest neighbor search and the optimization of the low
dimensional representation. The first of these has a multicore implementation
in the pynndescent library, which is used by UMAP if it is installed.
Otherwise UMAP uses its own version of nearest neighbor search, which is not
multicore. The second bottleneck, the optimization of the low dimensional
representation is performed via a (slightly custom) stochastic gradient
descent. SGD in UMAP can take advantage of multicore, but only if
`random_state` is set to `None`, which is the default (as explained in the
next question).

Is the output of UMAP reproducible?
-----------------------------------

Yes, but not by default. The random seed used by UMAP is not set by default
(`random_state` is set to `None`), so the resulting output embedding will
change if run repeatedly on the same input. UMAP is a stochastic algorithm,
so it is advisable to run it several times with no random seed set to confirm
that the conclusions you draw from the output are not affected by the
randomness in the algorithm. (Credit to Vito Zanotelli for this suggestion.)
Then once you are happy with the results, fix the seed to ensure the output is
reproducible. Having reproducible visual output is very useful to identically
reproduce an image for a paper, or to provide others with code that will
exactly reproduce your results.

When `random_state` is `None` the algorithm runs faster since it can take
advantage of multiple cores for some parts of the algorithm. This optimization
is not possible in the current implementation when a seed is set.

Can I add a custom loss function?
---------------------------------
Expand Down
6 changes: 4 additions & 2 deletions umap/umap_.py
Original file line number Diff line number Diff line change
Expand Up @@ -960,7 +960,7 @@ def simplicial_set_embedding(

parallel: bool (optional, default False)
Whether to run the computation using numba parallel.
Running in parallel is non-deterministic, and is not used
Running in parallel is non-deterministic, and should not be used
if a random seed has been set, to ensure reproducibility.

verbose: bool (optional, default False)
Expand Down Expand Up @@ -1257,7 +1257,9 @@ class UMAP(BaseEstimator):
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
by `np.random`. Furthermore, when set to None, UMAP can make use
of multiple cores for some parts of the algorithm that means it runs
faster, but at the expense of reproducibility.
sleighsoft marked this conversation as resolved.
Show resolved Hide resolved

metric_kwds: dict (optional, default None)
Arguments to pass on to the metric, such as the ``p`` value for
Expand Down