Why Correlation Approximation?

The correlation approximation engine is a spark-based implementation of the more well known Google Correlate.

When analyzing a new time series you may want to compare it against a bank of existing time series data to discover possible relationships in the data. You may also want to compare all time series in a data set to one another to finds correlations.

Time Series of IP Address Counts

Direct comparison of time series against other series may work for small or moderate size datasets, but with large data sets and long vectors the operation could take longer than a user is willing to wait. By using a scalable approximation technique you can answer these types of correlation queries on huge sets of data very quickly.

For more information on the origins of correlation approximation see Google correlate:

Google Correlate

Google Correlate Comic

Google Correlate White Paper

Implementation notes

This implementation is currently much simpler than Google Correlate. We've started with a simple system that can read local or hdfs files and can provide correlation results in local or hdfs files. We've also included a simple interactive command line interface.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why Correlation Approximation?

Implementation notes

Uh oh!

Uh oh!

Clone this wiki locally