You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copied from: scikit-learn/scikit-learn#27848
In the comments, it was suggested that scikit-learn-extras may be a better fit for this algorithm.
Describe the workflow you want to enable
RASTER is a very fast clustering algorithm that runs in linear time, uses constant memory, and only requires a single pass. The relevant package is cluster.
Describe your proposed solution
RASTER has been shown to be faster than all other clustering algorithms that are part of the cluster package (see comparative results in the "alternatives" field). A detailed description of the algorithm is in this paper. The key idea is that data points are projected onto a grid. This helper data structure that allows us to cluster data points at the desired level of precision and at a speed much faster than any other clustering algorithm we encountered in the literature. The closest comparison we were made aware of was CLIQUE, but RASTER is more efficient and, in fact, many orders of magnitude faster, which we have also shown experimentally, see Appendix B in the paper above.
Describe alternatives you've considered, if relevant
We compare RASTER to 10 other clustering algorithms, and have found that it outperforms them. RASTER is not only faster, it is also able to process greater amounts of data, ceteris paribus. Here is a summary of the results of our research:
Additional context
RASTER was discovered in the context of an industrial research project "FUMA - Fleet telematics big data analytics for vehicle Usage Modeling and Analysis" (description, final report), which was administered by the Swedish research agency VINNOVA. It was a collaboration between Scania, a Volkswagen subsidiary, and the Fraunhofer-Chalmers Center for Industrial Mathematics. This research project ran from 2016 to 2019.
A practical result of RASTER was that it made it possible to process TBs of real-world geo-spatial data on a local workstation instead of a data center, leading to significant cost and time-savings. This also implies that we could eliminate security risks as we could keep this highly confidential dataset in-house. In fact, due to the single-pass nature of RASTER, we could generate results locally much faster than we could have gotten them via a data center.
The text was updated successfully, but these errors were encountered:
Hello @gregorulm , thanks this seem interesting and could definitely be included into scikit-learn-extra.
Could you do a Pull Request from a fork, with a link to this issue.
For inclusion, you will need to add tests as well as examples and docstrings for your code, if you are upt for it you can do a PR and we can discuss there.
Copied from: scikit-learn/scikit-learn#27848
In the comments, it was suggested that scikit-learn-extras may be a better fit for this algorithm.
Describe the workflow you want to enable
RASTER is a very fast clustering algorithm that runs in linear time, uses constant memory, and only requires a single pass. The relevant package is
cluster
.Describe your proposed solution
RASTER has been shown to be faster than all other clustering algorithms that are part of the
cluster
package (see comparative results in the "alternatives" field). A detailed description of the algorithm is in this paper. The key idea is that data points are projected onto a grid. This helper data structure that allows us to cluster data points at the desired level of precision and at a speed much faster than any other clustering algorithm we encountered in the literature. The closest comparison we were made aware of was CLIQUE, but RASTER is more efficient and, in fact, many orders of magnitude faster, which we have also shown experimentally, see Appendix B in the paper above.Plots with comparisons:
Example of adjusting the precision parameter:
Pseudo-code:
Implementation:
https://github.com/FraunhoferChalmersCentre/raster/tree/master
The algorithm is furthermore parallelizable.
Describe alternatives you've considered, if relevant
We compare RASTER to 10 other clustering algorithms, and have found that it outperforms them. RASTER is not only faster, it is also able to process greater amounts of data, ceteris paribus. Here is a summary of the results of our research:
Additional context
RASTER was discovered in the context of an industrial research project "FUMA - Fleet telematics big data analytics for vehicle Usage Modeling and Analysis" (description, final report), which was administered by the Swedish research agency VINNOVA. It was a collaboration between Scania, a Volkswagen subsidiary, and the Fraunhofer-Chalmers Center for Industrial Mathematics. This research project ran from 2016 to 2019.
A practical result of RASTER was that it made it possible to process TBs of real-world geo-spatial data on a local workstation instead of a data center, leading to significant cost and time-savings. This also implies that we could eliminate security risks as we could keep this highly confidential dataset in-house. In fact, due to the single-pass nature of RASTER, we could generate results locally much faster than we could have gotten them via a data center.
The text was updated successfully, but these errors were encountered: