Skip to content

Why Correlation Approximation?

Justin Gawrilow edited this page Mar 19, 2014 · 7 revisions

The correlation approximation engine is a spark-based implementation of the more well known Google Correlate.

When analyzing a new time series you may want to compare it against a bank of existing time series data to discover possible relationships in the data. You may also want to compare all time series in a data set to one another to finds correlations.

Time Series of IP Address Counts

Direct comparison of time series against other series may work for small or moderate size datasets, but with large data sets and long vectors the operation could take longer than a user is willing to wait. By using a scalable approximation technique you can answer these types of correlation queries on huge sets of data very quickly.

For more information on the origins of correlation approximation see Google correlate:

Google Correlate

Google Correlate Comic

Google Correlate White Paper

Implementation notes

This implementation is currently much simpler than Google Correlate. We've started with a simple system that can read local or hdfs files and can provide correlation results in local or hdfs files. We've also included a simple interactive command line interface.

Clone this wiki locally