Tips for Practical Usage of UMAP #1136

tpapp157 · 2024-06-25T21:26:30Z

tpapp157
Jun 25, 2024

I've been using UMAP for years now and it's a truly great library but there is a general lack of tips for practical usage. I often see people make bad decisions in their implementations through lack of understanding/experience of the class parameters and severely undermine their work.

Of course, a lot depends on the specifics of the dataset. My primary use cases have been working with large and complex datasets in very high dimensional spaces.

Dimension Reduction
When the data manifold is complex and non-linear, UMAP for dimension reduction significantly outperforms traditional linear dimension reduction techniques. It's able to untangle these non-linearities and structure the reduced data much more usefully for downstream usecases like modeling, clustering, etc. The main question is how to choose n_components. I've found using PCA to estimate the proper number of reduced dimensions to work very well. Specifically something like:

P = PCA().fit( X )
n = np.where( np.cumsum( P.explained_variance_ratio_ ) >= 0.9 )[0][0] + 1
U = UMAP( n_components=n ).fit( X )

This uses PCA to find the number of dimensions needed to account for 90% of the dataset variance and sets this as the output dimensionality for UMAP. The exact number of components generally doesn't matter here, just that you're in the right ballpark and UMAP does a good job of accommodating.

Visualization
Using the technique above, I've found that implementing a two step dimension reduction process (first from high dimension to an intermediate dimensionality and then again down to 2D) often produces better visualizations than a one step process (from high dimension directly to 2D). Whether the final visualization is better depends a lot on the structure of the data and what information you're trying to emphasize, but in my experience it is better more often than not.

U0 = UMAP( n_components=n ).fit_transform( X )
U1 = UMAP( n_components=2 ).fit_transform( U0 )
plot( U1 )

Scale N_Neighbors
The default value of n_neighbors=5 is usually only useful for small datasets (tens of thousands of samples at most). At this point UMAP will completely fail to grasp the broader global distribution and will only capture very local topology. This is especially true if your data is somewhat clumped into separate regions. Larger values of n_neighbors will retain a better balance of local and global topology. Ideally you should scale n_neighbors proportionally with your dataset size (while trading off against the increased computation cost). I routinely use something like n_neighbors = int( 0.001 * X.shape[0] ) to fix the number of neighbors to 0.1% of the dataset size. This automatically handles the scaling and ensures that UMAP produces equivalent results even as the size of the dataset changes.

Any other tips?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tips for Practical Usage of UMAP #1136

{{title}}

Replies: 0 comments

Select a reply

Tips for Practical Usage of UMAP #1136

tpapp157 Jun 25, 2024

Replies: 0 comments

tpapp157
Jun 25, 2024