-
Notifications
You must be signed in to change notification settings - Fork 208
[ENH] Added R-Clustering clusterer to aeon #2382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thank you for contributing to
|
hi, thanks for this but if we include this clusterer we want it to use our version of Rocket transformers which are optimised for numba |
sure, I will try to reimplement it and use aeon Rocket transformers |
…_branch # Conflicts: # aeon/clustering/_r_cluster.py
@MatthewMiddlehurst I've resolved the PCA issue, and all test cases are now passing. I also added random_state in test case default param, similar to other clustering models in Aeon, to fix the test case error where estimator.labels_ was not matching estimator.fit(data). If you have some time, could you review the code and let me know if any improvements are needed? |
Hi @MatthewMiddlehurst , just checking in, kindly following up on this PR when you have a moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this is a complex PR and the project is currently very busy so it is unlikely this will be in soon. I have left a few comments but I don't imagine they will be the last.
I see you linked some results above at some point, but This seems to be for just one of the estimators? Not sure if that is this or the original code. One of the things I am going to ask for before merging is a comparison for both this and the original so that is also something you can do.
def check_params(self, X): | ||
""" | ||
Check and adjust parameters related to multiprocessing. | ||
|
||
Parameters | ||
---------- | ||
X : np.ndarray | ||
Input data. | ||
|
||
Returns | ||
------- | ||
np.ndarray | ||
Processed input data with float32 type. | ||
""" | ||
X = X.astype(np.float32) | ||
if self.n_jobs < 1 or self.n_jobs > multiprocessing.cpu_count(): | ||
n_jobs = multiprocessing.cpu_count() | ||
else: | ||
n_jobs = self.n_jobs | ||
set_num_threads(n_jobs) | ||
return X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this is required. Use the check_n_jobs
utility. If you are setting numba threads, make sure to set it back to the original when done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what changed? this looks the same.
from aeon.clustering.feature_based._r_cluster import RClusterer | ||
from aeon.datasets import load_gunpoint | ||
|
||
X_ = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is randomly generated data use the data generated utility in the individual test instead. If it is not what is the source?
The results were comparing how accurate or similar the original implementation is to this one. And for some reason if I dont set random state the test case fail at "assert np.array_equal(estimator.labels_, estimator.predict(data))", I am not sure why. btw, thanks for taking the time to review this. |
I see the image you posted, but it only has one column of ARI scores. I'm not sure what those scores are for, the clusters produced by your or produced by the original. We would want them for both so we can compare. Test failure seems legit. |
The ARI scores are calculated between the output produced by this implementation and the original one(i.e between original model output and this model output). The test cases only fail when random_state is not provided,as you can see, the estimator was passed without a random_state |
It should not fail with no random_state, or at least we should know why and that it is unsolvable. So you took the cluster predictions from both and used those to calculate ARI? That is not how you evaluate these algorithms if so. |
without specifying a random state, the transformed data (from _get_transformed_data) will be different even when using the same input, which results in differences between the predicted and labels (as we are calling _get_transformed_data in both fit and predict). I thought ARI is typically used to assess the similarity between two clustering outputs, such as between the original model and our implementation. However, if you'd prefer that I evaluate the similarity of each model's output against the true Y values instead, I'm happy to do that. |
Yes why is this happening.
Yes please do that. I am more interested on performance against the labels. Your previous results do show that there are large differences between the clusterers for some datasets it looks like? |
_get_parameterised_data uses quantiles = random_state.permutation(quantiles) so without setting a random state, the quantiles might change in fit and predict. Similarly, _fit_biases biases = _fit_biases(
X,
n_channels_per_combination,
channel_indices,
dilations,
num_features_per_dilation,
quantiles,
self.indices,
self.random_state,
) also depends on the random state,so without it, the parameters can end up different even for the same input
It also aligns with the original results: https://github.com/jorgemarcoes/R-Clustering/blob/main/results/benchmark_UCR_results.csv |
is |
Since those kernels depend on the input data, I figured it made sense to generate them based on the given data for the predict . The original source code only had fit_predict, so that’s what led me to think this way. Do you think it’s reasonable to use the same parameters that is used during fitting for predict too? |
No, a new kernel is a completely new feature essentially. The feature set you are creating in predict is completely different from the one you are generating in fit. |
Should I use the same parameters created in fit for predict too? |
@MatthewMiddlehurst I have updated the predict function to utilize the parameters from fit, and it’s now passing all the test cases. Feel free to take a look when you get a chance. |
Reference Issues/PRs
#2132
What does this implement/fix? Explain your changes.
added R clustering model for aeon
Does your contribution introduce a new dependency? If yes, which one?
no
Any other comments?
PR checklist
For all contributions
For new estimators and functions
__maintainer__
at the top of relevant files and want to be contacted regarding its maintenance. Unmaintained files may be removed. This is for the full file, and you should not add yourself if you are just making minor changes or do not want to help maintain its contents.For developers with write access