You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been using UMAP for years now and it's a truly great library but there is a general lack of tips for practical usage. I often see people make bad decisions in their implementations through lack of understanding/experience of the class parameters and severely undermine their work.
Of course, a lot depends on the specifics of the dataset. My primary use cases have been working with large and complex datasets in very high dimensional spaces.
Dimension Reduction
When the data manifold is complex and non-linear, UMAP for dimension reduction significantly outperforms traditional linear dimension reduction techniques. It's able to untangle these non-linearities and structure the reduced data much more usefully for downstream usecases like modeling, clustering, etc. The main question is how to choose n_components. I've found using PCA to estimate the proper number of reduced dimensions to work very well. Specifically something like:
P = PCA().fit( X )
n = np.where( np.cumsum( P.explained_variance_ratio_ ) >= 0.9 )[0][0] + 1
U = UMAP( n_components=n ).fit( X )
This uses PCA to find the number of dimensions needed to account for 90% of the dataset variance and sets this as the output dimensionality for UMAP. The exact number of components generally doesn't matter here, just that you're in the right ballpark and UMAP does a good job of accommodating.
Visualization
Using the technique above, I've found that implementing a two step dimension reduction process (first from high dimension to an intermediate dimensionality and then again down to 2D) often produces better visualizations than a one step process (from high dimension directly to 2D). Whether the final visualization is better depends a lot on the structure of the data and what information you're trying to emphasize, but in my experience it is better more often than not.
Scale N_Neighbors
The default value of n_neighbors=5 is usually only useful for small datasets (tens of thousands of samples at most). At this point UMAP will completely fail to grasp the broader global distribution and will only capture very local topology. This is especially true if your data is somewhat clumped into separate regions. Larger values of n_neighbors will retain a better balance of local and global topology. Ideally you should scale n_neighbors proportionally with your dataset size (while trading off against the increased computation cost). I routinely use something like n_neighbors = int( 0.001 * X.shape[0] ) to fix the number of neighbors to 0.1% of the dataset size. This automatically handles the scaling and ensures that UMAP produces equivalent results even as the size of the dataset changes.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I've been using UMAP for years now and it's a truly great library but there is a general lack of tips for practical usage. I often see people make bad decisions in their implementations through lack of understanding/experience of the class parameters and severely undermine their work.
Of course, a lot depends on the specifics of the dataset. My primary use cases have been working with large and complex datasets in very high dimensional spaces.
Dimension Reduction
When the data manifold is complex and non-linear, UMAP for dimension reduction significantly outperforms traditional linear dimension reduction techniques. It's able to untangle these non-linearities and structure the reduced data much more usefully for downstream usecases like modeling, clustering, etc. The main question is how to choose n_components. I've found using PCA to estimate the proper number of reduced dimensions to work very well. Specifically something like:
This uses PCA to find the number of dimensions needed to account for 90% of the dataset variance and sets this as the output dimensionality for UMAP. The exact number of components generally doesn't matter here, just that you're in the right ballpark and UMAP does a good job of accommodating.
Visualization
Using the technique above, I've found that implementing a two step dimension reduction process (first from high dimension to an intermediate dimensionality and then again down to 2D) often produces better visualizations than a one step process (from high dimension directly to 2D). Whether the final visualization is better depends a lot on the structure of the data and what information you're trying to emphasize, but in my experience it is better more often than not.
Scale N_Neighbors
The default value of n_neighbors=5 is usually only useful for small datasets (tens of thousands of samples at most). At this point UMAP will completely fail to grasp the broader global distribution and will only capture very local topology. This is especially true if your data is somewhat clumped into separate regions. Larger values of n_neighbors will retain a better balance of local and global topology. Ideally you should scale n_neighbors proportionally with your dataset size (while trading off against the increased computation cost). I routinely use something like
n_neighbors = int( 0.001 * X.shape[0] )
to fix the number of neighbors to 0.1% of the dataset size. This automatically handles the scaling and ensures that UMAP produces equivalent results even as the size of the dataset changes.Any other tips?
Beta Was this translation helpful? Give feedback.
All reactions