You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am looking for some help setting parameters for a large dataset.
I have a large dataset of ~1.1M 786-dimensional vectors coming from a pre-trained language model (BERT). I want to reduce them to 2 dimensions. When I run UMAP on a small subset of this data, I get very reasonable and intuitive looking clusters. This also seems to work no matter how the parameters are set.
However when I scale up to the full dataset, I lose all discernible structure - everything is just in a big blob. I have tried all combinations of the following with everything else set to default.
n_neighbors = [15, 50, 100, 200]
min_distance = [0.1, 0.5, 0.9]
metric = ["euclidean", "cosine"]
I also tried one or two runs where I set disconnection_distance to cut out outliers but it didn't help (In this case, I set init ="random" too).
In general, I am wondering if there are other parameters that can/should be explored for such a large dataset?
Does it make sense to increase n_neighbors proportionally with the size of the dataset?
You shouldn't have to increase n_neighbors proportionally to dataset size. The settings that work for subsets should work reasonably for the whole dataset. If that isn't the case there may be other issues. One possibility is that the plotting is simply not able to handle the full dataset. I don't know what you are using to examine/plot the data, but overplotting issues can easily lead an apparent uniform blob with that many datapoints. I would recommend umap.plot or using datashader directly for your plotting. If that isn't possible then significantly descresing point size and very low alpha channel settings may help a little. If plotting isn't the problem then it is perhaps a bug or other issue with UMAP itself. That may be harder to track down.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi,
I am looking for some help setting parameters for a large dataset.
I have a large dataset of ~1.1M 786-dimensional vectors coming from a pre-trained language model (BERT). I want to reduce them to 2 dimensions. When I run UMAP on a small subset of this data, I get very reasonable and intuitive looking clusters. This also seems to work no matter how the parameters are set.
However when I scale up to the full dataset, I lose all discernible structure - everything is just in a big blob. I have tried all combinations of the following with everything else set to default.
n_neighbors = [15, 50, 100, 200]
min_distance = [0.1, 0.5, 0.9]
metric = ["euclidean", "cosine"]
I also tried one or two runs where I set disconnection_distance to cut out outliers but it didn't help (In this case, I set init ="random" too).
In general, I am wondering if there are other parameters that can/should be explored for such a large dataset?
Does it make sense to increase n_neighbors proportionally with the size of the dataset?
Beta Was this translation helpful? Give feedback.
All reactions