Parameter Settings For Large Dataset #679

ciaranby · 2021-05-27T13:17:01Z

ciaranby
May 27, 2021

Hi,

I am looking for some help setting parameters for a large dataset.

I have a large dataset of ~1.1M 786-dimensional vectors coming from a pre-trained language model (BERT). I want to reduce them to 2 dimensions. When I run UMAP on a small subset of this data, I get very reasonable and intuitive looking clusters. This also seems to work no matter how the parameters are set.

However when I scale up to the full dataset, I lose all discernible structure - everything is just in a big blob. I have tried all combinations of the following with everything else set to default.
n_neighbors = [15, 50, 100, 200]
min_distance = [0.1, 0.5, 0.9]
metric = ["euclidean", "cosine"]

I also tried one or two runs where I set disconnection_distance to cut out outliers but it didn't help (In this case, I set init ="random" too).

In general, I am wondering if there are other parameters that can/should be explored for such a large dataset?

Does it make sense to increase n_neighbors proportionally with the size of the dataset?

lmcinnes · 2021-05-27T16:27:48Z

lmcinnes
May 27, 2021
Maintainer

You shouldn't have to increase n_neighbors proportionally to dataset size. The settings that work for subsets should work reasonably for the whole dataset. If that isn't the case there may be other issues. One possibility is that the plotting is simply not able to handle the full dataset. I don't know what you are using to examine/plot the data, but overplotting issues can easily lead an apparent uniform blob with that many datapoints. I would recommend umap.plot or using datashader directly for your plotting. If that isn't possible then significantly descresing point size and very low alpha channel settings may help a little. If plotting isn't the problem then it is perhaps a bug or other issue with UMAP itself. That may be harder to track down.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameter Settings For Large Dataset #679

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Parameter Settings For Large Dataset #679

ciaranby May 27, 2021

Replies: 1 comment

lmcinnes May 27, 2021 Maintainer

ciaranby
May 27, 2021

lmcinnes
May 27, 2021
Maintainer