Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lets umap run in parallel #3295

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion src/scanpy/tools/_umap.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ def umap(
key_added: str | None = None,
neighbors_key: str = "neighbors",
copy: bool = False,
parallel: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have the n_jobs: int | None convention for this (with None meaning sc.settings.N_JOBS), but simplicial_set_embedding just passes parallel on to numba.

We should think about how the two integrate before we add this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is numba parallel, we can only set it to use everything you got or nothing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that’s why we should talk about the parameter name. IIRC that would be the first parallelization parameter not called n_jobs.

) -> AnnData | None:
"""\
Embed the neighborhood graph using UMAP :cite:p:`McInnes2018`.
Expand Down Expand Up @@ -146,7 +147,8 @@ def umap(
:attr:`~anndata.AnnData.obsp`\\ ``[.uns[neighbors_key]['connectivities_key']]`` for connectivities.
copy
Return a copy instead of writing to adata.

parallel
Copy link
Contributor

@ilan-gold ilan-gold Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been tested? It seems like if the underlying implementation would error out every time a random seed is set, and we set a random seed, then setting this argument as true won't work?

I don't know where I got that this should error, but then we should definitely warn users about this sort of thing. Reproducibility is very important

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand:

sc.tl.umap(adata, parallel=True, random_state=42)

works so I think this needs a test + updated comment to reflect whatever is supposed to be going on here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As does

sc.tl.umap(adata, parallel=True, random_state=np.random.RandomState(42))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test to see if it errors (it doesn't)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we should also warn users about this random state business. And check that the warning is raised every time parallel is True because this is very important!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! But then we should definitely warn in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a warning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok there is in issue with check_random_state as it transforms None into a seed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok the function is bugged on the umap side. If you force None as the random seed, it errors. I'll make an issue with umap for clarification. I'll but the PR on ice while i wait for an update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether to run the computation using numba parallel. Running in parallel is non-deterministic.
Returns
-------
Returns `None` if `copy=False`, else returns an `AnnData` object. Sets the following fields:
Expand Down Expand Up @@ -214,6 +216,12 @@ def umap(
# for the init condition in the UMAP embedding
default_epochs = 500 if neighbors["connectivities"].shape[0] <= 10000 else 200
n_epochs = default_epochs if maxiter is None else maxiter
if parallel and random_state is not None:
warnings.warn(
"Parallel execution was expected to be disabled when both `parallel=True` and `random_state` are set, "
"to ensure reproducibility. However, parallel execution still seems to occur, which may lead to "
"non-deterministic results."
)
Comment on lines +219 to +224
Copy link
Contributor

@ilan-gold ilan-gold Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the umap repo to give user more context. Something like "UMAP reports that parallel execution should error with a random seed to ensure reproducibility. Parallel execution may occur nonetheless, which can lead to non-deterministic results." with a link to their docs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then make sure this warning is raised in tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update this once i have clarification from lmcinnes/umap#1155

X_umap, _ = simplicial_set_embedding(
data=X,
graph=neighbors["connectivities"].tocoo(),
Expand All @@ -232,6 +240,7 @@ def umap(
densmap_kwds={},
output_dens=False,
verbose=settings.verbosity > 3,
parallel=parallel,
)
elif method == "rapids":
msg = (
Expand Down
6 changes: 6 additions & 0 deletions tests/test_embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,9 @@ def test_diffmap():
sc.tl.diffmap(pbmc, random_state=1234)
d3 = pbmc.obsm["X_diffmap"].copy()
assert_raises(AssertionError, assert_array_equal, d1, d3)


def test_umap_parallel_randomstate():
pbmc = pbmc68k_reduced()[:100, :].copy()
sc.tl.umap(pbmc, parallel=True, random_state=42)
sc.tl.umap(pbmc, parallel=True, random_state=np.random.RandomState(42))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one action per test. This should be parametrized. Also, we don't need both of these given what I said above. We should just check for the warning with random_state as None or not-None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I updated the test already to catch the warning

Loading