notebooks/3-corresp-analysis.Rmd

---
author: "Maciej Beręsewicz"
title: "Correspondence Analysis"
output: html_notebook
---

# Setting for notebook

```{r}
library(reticulate)
use_python("/usr/local/anaconda3/bin/python")
```

```{r}
knitr::opts_chunk$set(engine.path = list(
  python = "/usr/local/anaconda3/bin/python",
  julia = "/Applications/Julia-1.3.app/Contents/Resources/julia/bin/"
))
```

# Correspondence analysis

## Correspondence analysis in R

```{r}
install.packages("ca")
library(ca)
library(vcd)
```

Data for the lecture

```{r}
data_vec <- c(1933, 1175, 1186, 646, 579, 671, 707, 780, 767, 768, 962, 1126)
data_mat <- matrix(data = data_vec, nrow = 4, ncol = 3, byrow = T)
rownames(data_mat) <- c("Rural", "city 20k", "city 20k-100k", "city 100k+")
colnames(data_mat) <- c("Realised","Refusals","Errors")
data_mat
```

In order to verify whether the correspondence analysis may be applied one should follow the steps:

1. verify relationship (using $\chi^2$ test)
2. verify correlation (using Cramers' V statistic)
3. proceed with correspondence analysis

```{r}
chisq.test(data_mat)
chisq.test(data_mat, correct = T) ## Yates' correction
```

```{r}
assocstats(data_mat)
```

```{r}
prop.table(data_mat, 1)
```

To conduct the correspondence analysis in R we can use function `ca` from the package `ca`.

```{r}
result <- ca(data_mat)
plot(result)
```

```{r}
result
```

```{r}
summary(result)
```

Read the data about shops and factors that drove respondents to select these shops


```{r}
shops <- readRDS("../data/data_ca_lecture.rds")
shops
```

```{r}
shops_tab <- xtabs(~ Sklep + Powod, data = shops)
shops_tab
```

```{r}
chisq.test(shops_tab, correct = T)
```

```{r}
assocstats(shops_tab)
```

1. why there is dependence in the data?
2. what factors correspond with which shops?
3. are there cluster of shops?
4. what are the main factors that drove respondents to select given shop?

```{r}
shops_ca <- ca(shops_tab)
plot(shops_ca)
```

```{r}
summary(shops_ca)
```


Follow the example about ages and tv stations

```{r}
vec <- c(26, 10, 40, 44, 18, 14, 15, 12, 21)
tab <- matrix(data = vec, nrow = 3, byrow = T)
colnames(tab) <- c("tvp1", "polsat", "tvn")
rownames(tab) <- c("<20","20-35", "35+")
chisq.test(tab)
```
```{r}
sum(tab)
```

```{r}
result_tv <- ca(tab)
result_tv
```

```{r}
plot(result_tv)
```

## Correspondence analysis in Python

To conduct correspondence analysis we need to install module `prince` (https://github.com/MaxHalford/prince) or we may implement the correspondence analysis by ourselves

```{python}
import prince
import pandas as pd
```

Take the data

```{python}
X = pd.DataFrame(
    data=[
        [1933, 1175, 1186],
        [646, 579, 671],
        [707, 780, 767],
        [768, 962, 1126]
    ],
    columns=pd.Series(["Realised","Refusals","Errors"]),
    index=pd.Series(["Rural", "city 20k", "city 20k-100k", "city 100k+"])
)
X
```

Define settings for correspondence analysis

```{python}
ca = prince.CA(n_components=2, n_iter=100, copy=True,check_input=True,engine='auto',random_state=42)
ca = ca.fit(X)
ca
```

```{python}
dir(ca)
```

The main important elements

col_masses_
row_masses_
column_coordinates
row_coordinates
eigenvalues_
explained_inertia_
total_inertia_

```{python}
print("---------- row coords ----------")
ca.row_coordinates(X)
print("---------- col coords ----------")
ca.column_coordinates(X)
print("---------- col masses ----------")
ca.col_masses_
print("---------- row masses ----------")
ca.row_masses_
print("---------- % of intertia explained ----------")
ca.explained_inertia_
```


```{python}
ca.plot_coordinates(X = X).get_figure()
```


## Correspondence analysis in Julia

We need to implement correspondence analysis by ourselves.


We start with loading library for Linear Algebra and define our own functions resembling R `rowSums` and `colSums`.

```{julia}
using LinearAlgebra
using Plots
gr()
using DataFrames
rowSums(x) = sum(x, dims = 2);
colSums(x) = sum(x, dims = 1);
```

Nest, we impelement step by step correspondence analysis using SVD algorithm

```{julia}
function ca(m)
    P = m/sum(m);
    r = rowSums(P); ## rowsums
    c = colSums(P); ## colsums
    Dr = Diagonal(r[:]); ## diagm(0 => r[:]); dense matrix
    Dc  = Diagonal(c[:]); ## diagm(0 => c[:]); dense matrix
    A = Dr^(-0.5)*(P - r*c)*Dc^(-0.5)
    U,Γ,V=svd(A);
    X = Dr^(-1/2) * U * Diagonal(Γ) 
    Y = Dc^(-1/2) * V * Diagonal(Γ)
    return(X, Y)
end
```

```{julia}
m = [1933 1175 1186;
     646 579 671;
     707 780 767;
     768 962 1126]
```


```{julia}
X,Y=ca(m);
print("---------- row coords ----------")
X[:,1:2]
print("---------- col coords ----------")
Y[:,1:2]
```

Plot the results

```{julia}
res = DataFrame(vcat(X[:,1:2], Y[:,1:2]));
res.label = ["Rural", "city 20k", "city 20k-100k", "city 100k+", "Realised","Refusals","Errors"];
res.color = ["red", "red", "red", "red", "blue","blue","blue"];
res
```

```{julia}
scatter(res.x1, res.x2, text = res.label, color = res.color)
```