Skip to content

[ENH] Added R-Clustering clusterer to aeon #2382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 140 commits into
base: main
Choose a base branch
from

Conversation

Ramana-Raja
Copy link
Contributor

@Ramana-Raja Ramana-Raja commented Nov 22, 2024

Reference Issues/PRs

#2132

What does this implement/fix? Explain your changes.

added R clustering model for aeon

Does your contribution introduce a new dependency? If yes, which one?

no

Any other comments?

PR checklist

For all contributions
  • I've added myself to the list of contributors. Alternatively, you can use the @all-contributors bot to do this for you.
  • The PR title starts with either [ENH], [MNT], [DOC], [BUG], [REF], [DEP] or [GOV] indicating whether the PR topic is related to enhancement, maintenance, documentation, bugs, refactoring, deprecation or governance.
For new estimators and functions
  • I've added the estimator to the online API documentation.
  • (OPTIONAL) I've added myself as a __maintainer__ at the top of relevant files and want to be contacted regarding its maintenance. Unmaintained files may be removed. This is for the full file, and you should not add yourself if you are just making minor changes or do not want to help maintain its contents.
For developers with write access
  • (OPTIONAL) I've updated aeon's CODEOWNERS to receive notifications about future changes to these files.

@aeon-actions-bot aeon-actions-bot bot added clustering Clustering package enhancement New feature, improvement request or other non-bug code enhancement labels Nov 22, 2024
@aeon-actions-bot
Copy link
Contributor

Thank you for contributing to aeon

I have added the following labels to this PR based on the title: [ $\color{#FEF1BE}{\textsf{enhancement}}$ ].
I have added the following labels to this PR based on the changes made: [ $\color{#4011F3}{\textsf{clustering}}$ ]. Feel free to change these if they do not properly represent the PR.

The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.

If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.

Don't hesitate to ask questions on the aeon Slack channel if you have any.

PR CI actions

These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.

  • Run pre-commit checks for all files
  • Run mypy typecheck tests
  • Run all pytest tests and configurations
  • Run all notebook example tests
  • Run numba-disabled codecov tests
  • Stop automatic pre-commit fixes (always disabled for drafts)
  • Disable numba cache loading
  • Push an empty commit to re-run CI checks

@Ramana-Raja Ramana-Raja changed the title [ENH] Added R-Clustering clusterer to aeon for issue #2132 [ENH] Added R-Clustering clusterer to aeon #2132 Nov 22, 2024
@Ramana-Raja Ramana-Raja changed the title [ENH] Added R-Clustering clusterer to aeon #2132 [ENH] Added R-Clustering clusterer to aeon Nov 22, 2024
@TonyBagnall
Copy link
Contributor

hi, thanks for this but if we include this clusterer we want it to use our version of Rocket transformers which are optimised for numba

@Ramana-Raja
Copy link
Contributor Author

hi, thanks for this but if we include this clusterer we want it to use our version of Rocket transformers which are optimised for numba

sure, I will try to reimplement it and use aeon Rocket transformers

@Ramana-Raja
Copy link
Contributor Author

@MatthewMiddlehurst I've resolved the PCA issue, and all test cases are now passing. I also added random_state in test case default param, similar to other clustering models in Aeon, to fix the test case error where estimator.labels_ was not matching estimator.fit(data). If you have some time, could you review the code and let me know if any improvements are needed?

@Ramana-Raja
Copy link
Contributor Author

@MatthewMiddlehurst I've resolved the PCA issue, and all test cases are now passing. I also added random_state in test case default param, similar to other clustering models in Aeon, to fix the test case error where estimator.labels_ was not matching estimator.fit(data). If you have some time, could you review the code and let me know if any improvements are needed?

Hi @MatthewMiddlehurst , just checking in, kindly following up on this PR when you have a moment.

Copy link
Member

@MatthewMiddlehurst MatthewMiddlehurst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this is a complex PR and the project is currently very busy so it is unlikely this will be in soon. I have left a few comments but I don't imagine they will be the last.

I see you linked some results above at some point, but This seems to be for just one of the estimators? Not sure if that is this or the original code. One of the things I am going to ask for before merging is a comparison for both this and the original so that is also something you can do.

Comment on lines +409 to +429
def check_params(self, X):
"""
Check and adjust parameters related to multiprocessing.

Parameters
----------
X : np.ndarray
Input data.

Returns
-------
np.ndarray
Processed input data with float32 type.
"""
X = X.astype(np.float32)
if self.n_jobs < 1 or self.n_jobs > multiprocessing.cpu_count():
n_jobs = multiprocessing.cpu_count()
else:
n_jobs = self.n_jobs
set_num_threads(n_jobs)
return X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is required. Use the check_n_jobs utility. If you are setting numba threads, make sure to set it back to the original when done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what changed? this looks the same.

from aeon.clustering.feature_based._r_cluster import RClusterer
from aeon.datasets import load_gunpoint

X_ = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is randomly generated data use the data generated utility in the individual test instead. If it is not what is the source?

@Ramana-Raja
Copy link
Contributor Author

Ramana-Raja commented Apr 4, 2025

Hi, this is a complex PR and the project is currently very busy so it is unlikely this will be in soon. I have left a few comments but I don't imagine they will be the last.

I see you linked some results above at some point, but This seems to be for just one of the estimators? Not sure if that is this or the original code. One of the things I am going to ask for before merging is a comparison for both this and the original so that is also something you can do.

The results were comparing how accurate or similar the original implementation is to this one. And for some reason if I dont set random state the test case fail at "assert np.array_equal(estimator.labels_, estimator.predict(data))", I am not sure why. btw, thanks for taking the time to review this.

@MatthewMiddlehurst
Copy link
Member

I see the image you posted, but it only has one column of ARI scores. I'm not sure what those scores are for, the clusters produced by your or produced by the original. We would want them for both so we can compare.

Test failure seems legit. _labels and predict output should be the same.

@Ramana-Raja
Copy link
Contributor Author

Ramana-Raja commented Apr 5, 2025

I see the image you posted, but it only has one column of ARI scores. I'm not sure what those scores are for, the clusters produced by your or produced by the original. We would want them for both so we can compare.

Test failure seems legit. _labels and predict output should be the same.

The ARI scores are calculated between the output produced by this implementation and the original one(i.e between original model output and this model output). The test cases only fail when random_state is not provided,as you can see, the estimator was passed without a random_state

image

@MatthewMiddlehurst
Copy link
Member

It should not fail with no random_state, or at least we should know why and that it is unsolvable.

So you took the cluster predictions from both and used those to calculate ARI? That is not how you evaluate these algorithms if so.

@Ramana-Raja
Copy link
Contributor Author

Ramana-Raja commented Apr 6, 2025

It should not fail with no random_state, or at least we should know why and that it is unsolvable.

So you took the cluster predictions from both and used those to calculate ARI? That is not how you evaluate these algorithms if so.

without specifying a random state, the transformed data (from _get_transformed_data) will be different even when using the same input, which results in differences between the predicted and labels (as we are calling _get_transformed_data in both fit and predict). I thought ARI is typically used to assess the similarity between two clustering outputs, such as between the original model and our implementation. However, if you'd prefer that I evaluate the similarity of each model's output against the true Y values instead, I'm happy to do that.

@MatthewMiddlehurst
Copy link
Member

without specifying a random state, the transformed data (from _get_transformed_data) will be different even when using the same input, which results in differences between the predicted and labels (as we are calling _get_transformed_data in both fit and predict).

Yes why is this happening.

I thought ARI is typically used to assess the similarity between two clustering outputs, such as between the original model and our implementation. However, if you'd prefer that I evaluate the similarity of each model's output against the true Y values instead, I'm happy to do that.

Yes please do that. I am more interested on performance against the labels. Your previous results do show that there are large differences between the clusterers for some datasets it looks like?

@Ramana-Raja
Copy link
Contributor Author

Ramana-Raja commented Apr 11, 2025

without specifying a random state, the transformed data (from _get_transformed_data) will be different even when using the same input, which results in differences between the predicted and labels (as we are calling _get_transformed_data in both fit and predict).

Yes why is this happening.

_get_parameterised_data uses

quantiles = random_state.permutation(quantiles)

so without setting a random state, the quantiles might change in fit and predict. Similarly, _fit_biases

biases = _fit_biases(
            X,
            n_channels_per_combination,
            channel_indices,
            dilations,
            num_features_per_dilation,
            quantiles,
            self.indices,
            self.random_state,
        )

also depends on the random state,so without it, the parameters can end up different even for the same input

I thought ARI is typically used to assess the similarity between two clustering outputs, such as between the original model and our implementation. However, if you'd prefer that I evaluate the similarity of each model's output against the true Y values instead, I'm happy to do that.

Yes please do that. I am more interested on performance against the labels. Your previous results do show that there are large differences between the clusterers for some datasets it looks like?

Here is the result:
image

It also aligns with the original results: https://github.com/jorgemarcoes/R-Clustering/blob/main/results/benchmark_UCR_results.csv

@MatthewMiddlehurst
Copy link
Member

is _get_parameterised_data generating the rocket kernels? If so why are we generating new kernels in predict?

@Ramana-Raja
Copy link
Contributor Author

Ramana-Raja commented Apr 11, 2025

is _get_parameterised_data generating the rocket kernels? If so why are we generating new kernels in predict?

Since those kernels depend on the input data, I figured it made sense to generate them based on the given data for the predict . The original source code only had fit_predict, so that’s what led me to think this way. Do you think it’s reasonable to use the same parameters that is used during fitting for predict too?

@MatthewMiddlehurst
Copy link
Member

No, a new kernel is a completely new feature essentially. The feature set you are creating in predict is completely different from the one you are generating in fit.

@Ramana-Raja
Copy link
Contributor Author

No, a new kernel is a completely new feature essentially. The feature set you are creating in predict is completely different from the one you are generating in fit.

Should I use the same parameters created in fit for predict too?

@Ramana-Raja
Copy link
Contributor Author

Ramana-Raja commented Apr 12, 2025

@MatthewMiddlehurst I have updated the predict function to utilize the parameters from fit, and it’s now passing all the test cases. Feel free to take a look when you get a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clustering Clustering package enhancement New feature, improvement request or other non-bug code enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants