A question about the running time of UMAP #975

LaJollaClustering · 2023-02-27T22:21:46Z

LaJollaClustering
Feb 27, 2023

Hi there:

I am trying to read the UMAP algorithm carefully because the idea is really great!

I have quick question about the running time of UMAP. In this paper, https://arxiv.org/pdf/1802.03426.pdf, it claims that the empirical running time is N^{1.14} which is due to the construction of kNN graph.

However, in the Algorithm 4 (Spectral embedding), it has to calculate the eigenvectors of a N x N matrix. I assume that both A,D,L are N x N matrices. Did I make any mistakes? If A, D, L are N x N matrices, then the running time would be N^2, right?

jc-healy · 2023-02-27T23:13:07Z

jc-healy
Feb 27, 2023
Collaborator

Hi LaJollaClustering, The actual SpectralEmbedding we are doing in UMAP is on the centroids of the connected components of the UMAP complex (rescaled k nearest neighbour graph). Its main advantage is that it initializes these disconnected components into half reasonable locations within our space. Given that there are typically very few connected components the computational cost is minimal. I hope that helps.

…

On Mon, Feb 27, 2023 at 5:22 PM LaJollaClustering ***@***.***> wrote: Hi there: I am trying to read the UMAP algorithm carefully because the idea is really great! I have quick question about the running time of UMAP. In this paper, https://arxiv.org/pdf/1802.03426.pdf, it claims that the empirical running time is N^{1.14} which is due to the construction of kNN graph. However, in the Algorithm 4 (Spectral embedding), it has to calculate the eigenvectors of a N x N matrix. I assume that both A,D,L are N x N matrices. Did I make any mistakes? — Reply to this email directly, view it on GitHub <#975>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3IUWXMAWGKORBB4SIXUJLWZUSIVANCNFSM6AAAAAAVJ7BAFU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

1 reply

LaJollaClustering Feb 28, 2023
Author

Thank you John. It is very helpful.

lmcinnes · 2023-02-27T23:16:59Z

lmcinnes
Feb 27, 2023
Maintainer

The initialization via a spectral approach uses a sparse matrix -- and the expected number of non-zeros is roughly O(kN), so the complexity is lower. It is also worth noting that, in practical terms, the spectral initialization is almost free in comparison to the nearest neighbor and layout optimization phases.

5 replies

LaJollaClustering Feb 28, 2023
Author

Thank you Leland. If I understand it correctly, the running time then depends on the speed of calculating eigenvectors for sparse matrices.

As far as I know, there is no theoretical analysis in this setting. And I believe the worst-case running time should be N^2. However, in practice, it seems that scipy is very good for many datasets.

I am asking because we are designing dimensionality reduction algorithms for really large-scale datasets. Really large means at least 10 millions vectors. In this regime, we really care about the dependency on N.

lmcinnes Feb 28, 2023
Maintainer

I think the worst case might well be N^2 but I haven't looked at that in a long time, and there may well be constraints for the particular kinds of sparse structures we have that may well make it possible to get a lower worst case. If nothing else I have done tend of millions of vectors and while the spectral initialization cost is non-trivial by that point, its still quite possible to constrain it. In particular since most approaches for merely obtaining the top few eigenvectors (and we only need n_components many) are power method based (block Lanczos or randomized versions) it is quite practical to simply cap the power iterations and take what you get. Since we are only need this for initialization an imperfect solution is fine. The code may actually even do this. Regardless, in that setting you need only do as much work as you wish.

LaJollaClustering Feb 28, 2023
Author

Thank you Leland for these detailed responses! BTW, I have also spent a lot of time on single-cell analysis using UMAP. From my experience, it seems that the performance of PCA + UMAP is much better than UMAP itself. My experience was that, if we use PCA to project the dimension into (roughly) 10 - 50 PCs, UMAP has a much better performance then. I checked many number of single-cell data, the observation is very consistent.

jlmelville Mar 1, 2023
Collaborator

From my experience, it seems that the performance of PCA + UMAP is much better than UMAP itself.

Assuming you mean "faster" by better performance: the nearest neighbor search is the performance bottleneck in UMAP and in turn the nearest neighbor search is dominated by the distance calculation, which in most cases is O(D) where D is the dimensionality of the input space. So reducing the initial dimensionality by PCA will always pay off in terms of speed for any realistically-sized, high-dimensional dataset.

Also I know I pop up to say this fairly regularly but it bears repeating: arbitrarily reducing the initial dimensionality to 10-50 PCs is a very dangerous game to play with your data.

LaJollaClustering Mar 1, 2023
Author

I meant "higher accuracy" by better performance. I played many single-cell RNA-seq datasets.

For some complicated datasets, UMAP itself mixed clusters a lot, but PCA+UMAP separated them VERY WELL!!! In general, PCA + UMAP (by using 20-30 PCs) usually achieved better accuracy. Sometimes it is much better. I have the figures in my computer, but I do not know how to share them here.

In one of the most popular single-cell RNA analysis toolkit (Seurat), it also runs PCA before UMAP by the default. I think this is a common believe in single-cell RNA-seq area.

See Seurat code here (https://satijalab.org/seurat/reference/runumap) which they said, "reduction: Which dimensional reduction (PCA or ICA) to use for the UMAP input. Default is PCA"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question about the running time of UMAP #975

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

A question about the running time of UMAP #975

LaJollaClustering Feb 27, 2023

Replies: 2 comments · 6 replies

jc-healy Feb 27, 2023 Collaborator

LaJollaClustering Feb 28, 2023 Author

lmcinnes Feb 27, 2023 Maintainer

LaJollaClustering Feb 28, 2023 Author

lmcinnes Feb 28, 2023 Maintainer

LaJollaClustering Feb 28, 2023 Author

jlmelville Mar 1, 2023 Collaborator

LaJollaClustering Mar 1, 2023 Author

LaJollaClustering
Feb 27, 2023

Replies: 2 comments 6 replies

jc-healy
Feb 27, 2023
Collaborator

LaJollaClustering Feb 28, 2023
Author

lmcinnes
Feb 27, 2023
Maintainer

LaJollaClustering Feb 28, 2023
Author

lmcinnes Feb 28, 2023
Maintainer

LaJollaClustering Feb 28, 2023
Author

jlmelville Mar 1, 2023
Collaborator

LaJollaClustering Mar 1, 2023
Author