Transform method: why is it using hash #637

aarondbaron · 2021-04-01T05:32:13Z

aarondbaron
Apr 1, 2021

In the transform method, there is this code:

#If we just have the original input then short circuit things
       X = check_array(X, dtype=np.float32, accept_sparse="csr", order="C")
       x_hash = joblib.hash(X)
       if x_hash == self._input_hash:
           if self.transform_mode == "embedding":
               return self.embedding_

Why is this code here at all? If one were checking to see how the transform method operates in comparison to fit_transform, one simple strategy would have been to just apply the transform to the original samples. Then you'd be able to see that the transform method actually takes quite a long time to compute. Having this hash check does not seem right, and the purpose of having it here at all isn't clear.

More generally, the transform method doesn't seem all that useful when trying to scale to larger datasets. It appears that, similar to the approach mentioned by Van der Maaten about T-SNE, the best way to achieve a transform is to train a neural network to learn the embedding space, which also seems to be what Parametric UMAP is supposed to do. Is there any use of even having the transform at all, if it's not going to be faster than just performing fit_transform? Should it be deprecated in favor of using Parametric UMAP instead?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform method: why is it using hash #637

{{title}}

Replies: 0 comments

Select a reply

Transform method: why is it using hash #637

aarondbaron Apr 1, 2021

Replies: 0 comments

aarondbaron
Apr 1, 2021