Does my clustering pipeline make sense? #1068
Replies: 1 comment 1 reply
-
Your pipeline makes some sense. Some comments: I am not sure you need the standard scaler, depending on exactly what your autoencoder is doing. You certainly don't need the PCA; UMAP should potentially work fine on 512 dimensions. Given that you are using an autoencoder, and are in a high dimensional case, it would be beneficial to use In terms of the number of components; if you want a good rule of thumb then the n_neighbors value is a decent starting point if clustering is your goal; in your case that's pretty high, so you could just start at 10. As for how to visualize -- you can do two separate UMAP runs; one to a large n_components, followed by clustering, and a second to 2 dimensions for visualization. You can then use the cluster label vector to color the points in 2d visualization. This can also highlight if there is structure that is getting squashed in 2d. None the less, given the visualization you have, and the granularity of clustering you want, I would say this looks like a pretty good result. If you have the compute for a few more experiments it might be worth exploring the options to see that this kind of clustering is relatively stable, but if you just want to get to work on the further tasks I don't see that this would not provide a reasonable baseline for your stratified sampling. |
Beta Was this translation helpful? Give feedback.
-
I am working on putting together a surrogate model that will predict the output of a micromechanical simulation on 3D pore defects produced in Additive Manufacturing. The idea is that the simulation is time intensive and impractical to perform on my entire large dataset of pore volumes, therefore I can perform the simulation on a much smaller training dataset to train the surrogate model, and then use the trained model to predict the stress field outputs for the rest of the dataset. To create that smaller training dataset I am using clustering and stratified sampling based on cluster labels to make sure I am getting a balanced training dataset encompassing all the different characteristic pore morphologies present. I'm also using the clustering to analyze pore morphologies in general, and some type of visualization would be useful for mapping values like stress concentration factors to the cluster embeddings. To get the features I am feeding into my clustering pipeline I am using an autoencoder. It is a 3D autoencoder so the feature vectors are fairly large at 512, so I first pass it to PCA with 99% variance to knock it down to 115 components, and then on to UMAP and HDBSCAN. So the pipeline is:
AE(512 components) => StandardScaler() => PCA(0.99) => UMAP(n_components=?) => HDBSCAN.
Does this clustering pipeline make sense starting from autoencoder feature vectors, or how could I improve it?
Also, I had a few other questions that would be great to get some advice on:
This is what I get using my pipeline with n_components=2. It seems like clusters are mostly well defined and there is decent enough separation. It is 516,517 pores with UMAP(n_neighbors=250, min_dist=0, n_components=2) and HDBSCAN(min_cluster_size=1000, min_samples=400). Should I be happy with this result, or would it still be better to explore a larger number of components?
Beta Was this translation helpful? Give feedback.
All reactions