Does my clustering pipeline make sense? #1068

daniel-a-diaz · 2023-10-13T22:20:16Z

daniel-a-diaz
Oct 13, 2023

I am working on putting together a surrogate model that will predict the output of a micromechanical simulation on 3D pore defects produced in Additive Manufacturing. The idea is that the simulation is time intensive and impractical to perform on my entire large dataset of pore volumes, therefore I can perform the simulation on a much smaller training dataset to train the surrogate model, and then use the trained model to predict the stress field outputs for the rest of the dataset. To create that smaller training dataset I am using clustering and stratified sampling based on cluster labels to make sure I am getting a balanced training dataset encompassing all the different characteristic pore morphologies present. I'm also using the clustering to analyze pore morphologies in general, and some type of visualization would be useful for mapping values like stress concentration factors to the cluster embeddings. To get the features I am feeding into my clustering pipeline I am using an autoencoder. It is a 3D autoencoder so the feature vectors are fairly large at 512, so I first pass it to PCA with 99% variance to knock it down to 115 components, and then on to UMAP and HDBSCAN. So the pipeline is:
AE(512 components) => StandardScaler() => PCA(0.99) => UMAP(n_components=?) => HDBSCAN.

Does this clustering pipeline make sense starting from autoencoder feature vectors, or how could I improve it?

Also, I had a few other questions that would be great to get some advice on:

Is using StandardScaler() appropriate here?
In the UMAP documentation it suggested to use n_components=10-20 for clustering instead of just using 2 as you would for visualization. Is there a particular metric that can help me zero in on the correct number of components? I have seen silhouette score, would that be useful here?
If I do use a larger number of components for the clustering, is there still a good way to get some type of visualization out of it for mapping values onto?

This is what I get using my pipeline with n_components=2. It seems like clusters are mostly well defined and there is decent enough separation. It is 516,517 pores with UMAP(n_neighbors=250, min_dist=0, n_components=2) and HDBSCAN(min_cluster_size=1000, min_samples=400). Should I be happy with this result, or would it still be better to explore a larger number of components?

lmcinnes · 2023-10-15T14:13:18Z

lmcinnes
Oct 15, 2023
Maintainer

Your pipeline makes some sense. Some comments: I am not sure you need the standard scaler, depending on exactly what your autoencoder is doing. You certainly don't need the PCA; UMAP should potentially work fine on 512 dimensions. Given that you are using an autoencoder, and are in a high dimensional case, it would be beneficial to use metric="cosine" for UMAP if you aren't already.

In terms of the number of components; if you want a good rule of thumb then the n_neighbors value is a decent starting point if clustering is your goal; in your case that's pretty high, so you could just start at 10. As for how to visualize -- you can do two separate UMAP runs; one to a large n_components, followed by clustering, and a second to 2 dimensions for visualization. You can then use the cluster label vector to color the points in 2d visualization. This can also highlight if there is structure that is getting squashed in 2d.

None the less, given the visualization you have, and the granularity of clustering you want, I would say this looks like a pretty good result. If you have the compute for a few more experiments it might be worth exploring the options to see that this kind of clustering is relatively stable, but if you just want to get to work on the further tasks I don't see that this would not provide a reasonable baseline for your stratified sampling.

1 reply

daniel-a-diaz Oct 17, 2023
Author

Thanks for the quick reply. This is exactly what I was looking for. One question I have about your response is on n_neighbors for UMAP. Did you mean 100, or indeed just 10 for my 500,000+ dataset? And for HDBSCAN, is it worth exploring other metrics, or should the default 'euclidean' be just fine for my problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does my clustering pipeline make sense? #1068

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Does my clustering pipeline make sense? #1068

daniel-a-diaz Oct 13, 2023

Replies: 1 comment · 1 reply

lmcinnes Oct 15, 2023 Maintainer

daniel-a-diaz Oct 17, 2023 Author

daniel-a-diaz
Oct 13, 2023

Replies: 1 comment 1 reply

lmcinnes
Oct 15, 2023
Maintainer

daniel-a-diaz Oct 17, 2023
Author