In statistics, the Pearson product-moment correlation coefficient (or simply "the correlation coefficient") is a standard measure of the extent and direction to which two variables move together. Ranging from [-1,1] where 1 implies perfect correlation and -1 implies perfect inverse correlation, this statistic encapsulates the ratio between two variables' covariance (the numerator) and the product of their variances (the denominator).
Equation 1: Pearson Product-Moment Correlation Coefficient
An key assumption of this statistic is that the underlying relationship between these two statistics is linear. However, this assumption of linearity is often not borne out in reality. Imagine we are assessing the relationship between the amount of money spent on ads targeting visitors of a given website, and the rate of conversion from visitor to paying customer. We could easyily imagine a scenario where up to a certain point, more resources spent on ads tends to increase conversion. However, there may come a point where the prevalence of ads is so great that it is actually offputting to the consumer, accomplishing the opposite of its intended purpose. This scenario is not theoretical, but has been validated by survey data. The implication is that while ad spend may relate intimately to conversion, the correlation coefficient between these two variables is likley to be small - to the point of approaching zero.
The scatterplots below illustrate how, when the relationship between two variables involves a change in direction, the Pearson Product-Moment Correlation Coefficient fails to report the true degree of dependence between variables.
Image 1: Sets of Pearson Correlation Coefficients
SOURCE: https://commons.wikimedia.org/wiki/File:Correlation_examples2.svg
In 2007, Gábor J. Székely called attention to this important limitation of the correlation coefficient and introduced the concept of 'distance correlation' as part of his conception of 'E-statistics' - statistics concerning the energy distance between probability distributions. Within the framework of E-statistics, Székely re-formulated many classical statistical concepts, such as 'distance variance' versus variance, 'distance standard deviation' versus standard deviation, and 'distance covariance' versus covariance. Using these, the definition of correlation coefficient can be re-written, but in such a way that a value of zero occurs if, and only if the two variables are genuinely independent.
Equation 2: Distance Correlation
Image 2: Sets of Distance Correlation Coefficients
SOURCE: https://commons.wikimedia.org/wiki/File:Distance_Correlation_Examples.svg
For example, let's create some data using R:
x = c(0, 1, 2, 3, 4)
y = c(2, 1, 0, 1, 2)
Next, we derive a matrix for each variable containing the pairwise distances for that variable. For the purposes of calculating the distance covariance, we use the Euclidean distance. If we were exploring two-dimensional observations (for example, on the Cartesian plane) the appropriate formulation of the Euclidean distance would be as follows:
However, in the example below X and Y are each univariate, and so the Euclidean distance reduces to the absolute value of the differences between observations.
This can be done in R by calling the 'dist' method and specifying "euclidean" as the distance.
x_mat <- dist(x, method = "euclidean", diag = TRUE, upper = TRUE, p = 2)
y_mat <- dist(y, method = "euclidean", diag = TRUE, upper = TRUE, p = 2)
We will also need the column and row means from these distance matrices, as well as the grand mean of those means. If you were to derive these manually, you might use a function like the following:
take_doubly_centered_distances <- function(x_mat) {
library(reshape2)
x_df <- melt(as.matrix(x_mat), varnames = c("row", "col"))
x_row_means <- aggregate(x_df, list(x_df$row), mean)
x_row_means <- subset(x_row_means, select = -c(Group.1, col))
names(x_row_means) <- c("row", "row_mean")
x_df <- merge(x=x_df, y=x_row_means, by="row")
x_col_means <- aggregate(x_df, list(x_df$col), mean)
x_col_means <- subset(x_col_means, select = -c(Group.1, row, row_mean))
names(x_col_means) <- c("col", "col_mean")
x_df <- merge(x=x_df, y=x_col_means, by="col")
x_df$grand_mean <- mean(c(x_row_means$row_mean, x_col_means$col_mean))
x_df$X <- x_df$value - x_df$row_mean - x_df$col_mean + x_df$grand_mean
x_df = x_df[with(x_df, order(col, row)), ]
myList <- list()
for (i in unique(x_df[["col"]])){
myList[[length(myList)+1]] <- x_df[x_df$col == i,]$X
}
output <- matrix(unlist(myList), ncol = length(unique(x_df[["col"]])), byrow = TRUE)
return(output)
}
..resulting in the following:
X Pair-Wise Distances | Y Pair-Wise Distances | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Tables 1 & 2: Pair-Wise Distances
We need to doubly center these distance matrices - doubly in this context means we will first subtract from each element its row mean, and secondly subtract its column mean before adding to each element the grand mean.The resulting matrices should have all rows and all columns sum to zero.
X Doubly Centered Distances | Y Doubly Centered Distances | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Tables 3 & 4: Distance Matrices After Doubly Centering
Next, we need to take the arithmetic average of the products of the doubly centered matrices. The summed products is also referred to as the Frobenius inner product, which we subsequently multiply times 1 over n squared to yield the arithmetic average.
Equation 3: Squared Sample Distance Covariance
We can manually do this in R via the 'matrixcalc' library.
arithmetic_average_of_products <- function(x_mat, y_mat) {
library(matrixcalc)
if ((nrow(x_mat) == nrow(y_mat)) & (ncol(x_mat) == ncol(y_mat))) {
val <- frobenius.prod(x_mat, y_mat)
return(val*(1/nrow(x_mat)^2))
}
}
Finally, we take the square root of this result to get the sample distance covariance. If compare the results with R's 'energy' package, we see that the results are the same:
> arithmetic_average_of_products(x_mat, y_mat)^(1/2)
0.438178
>
> library(energy)
> dcov.test(x, y, index = 1.0, R = NULL)
Specify the number of replicates R (R > 0) for an independence test
data: index 1, replicates 0
nV^2 = 0.96, p-value = NA
sample estimates:
dCov
0.438178
-
Björn Böttcher, Martin Keller-Ressel, René L. Schilling. (2019), Distance multivariance: New Dependence Measures for Random Vectors, The Annals of Statistics, Vol. 47, No. 5, pp.2757-2789. https://projecteuclid.org/euclid.aos/1564797863
-
Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007), Measuring and Testing Dependence by Correlation of Distances, Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794. http://dx.doi.org/10.1214/009053607000000505
-
Szekely, G.J. and Rizzo, M.L. (2009), Brownian Distance Covariance, Annals of Applied Statistics, Vol. 3, No. 4, 1236-1265. http://dx.doi.org/10.1214/09-AOAS312
-
Szekely, G.J. and Rizzo, M.L. (2009), Rejoinder: Brownian Distance Covariance, Annals of Applied Statistics, Vol. 3, No. 4, 1303-1308. https://projecteuclid.org/euclid.aoas/1267453941