LDA and Principal Component Analysis (PCA) are two techniques for dimensionality reduction. PCA can be decribed as an unsupervised algorithm that ignores data labels and aims to find directions which maximalize the variance in a data. In comparison with PCA, LDA is a supervised algorithm and aims to project a dataset onto a lower dimensional space with good class separability. In other words, LDA maximalizes the ratio of betweenclass variance and the within-class variance in a given data.
LDA finds directions where classes are well-separated,
i.e. LDA maximizes the ratio of between-class variance and the within-class
variance. Firstly, assume that
The between-classes scatter matrix SB is defined as:
The within-classes scatter matrix
Next, we will solve the generalized eigenvalue problem for the matrix
where
#number of classes = n
SolveEigenProblem <- function(withinMatrix, betweenMatrix, prior)
{
# Sw^-1 * Sb solve: https://www.geeksforgeeks.org/inverse-of-matrix-in-r/?ref=lbp, https://www.geeksforgeeks.org/solve-linear-algebraic-equation-in-r-programming-solve-function/
eivectors = eigen(solve(withinMatrix) %*% betweenMatrix)
return(eivectors)
}
Data are projected into lower-dimensional subspace. TODO
The ComputeWithinScatter and ComputeBetweenScatter functions were modified to include the label parameter, and the ComputeBetweenScatter function was also modified to include the mean parameter, because without these modifications, errors were continuously thrown.
We experimented with two techniques for dimensionality reduction, PCA (Principal Component Analysis) and LDA (Linear Discriminant Analysis), using a provided dataset of wines. Each wine in the dataset is characterized by the following thirteen attributes:
- Alcohol
- Malic Acid
- Ash
- Alkalinity of Ash
- Magnesium
- Total Phenols
- Flavanoids
- Nonflavanoid Phenols
- Proanthocyanins
- Color Intensity
- Hue
- OD280/OD315 of Diluted Wines
- Proline
After analyzing the mentioned dataset (e.g., by outputting values), we can observe significant differences between the values of attributes V2-V14, where higher values are noted for attributes like V14 and V11. The training accuracy value matches the classification accuracy mentioned in the wine_info.txt document. However, since the accLDA value is not equal to 1, it indicates that the wines are not perfectly linearly separated into classes. This means that it would be appropriate to consider a more suitable method for the given wine dataset. According to: "The data was used with many others for comparing various classifiers. The classes are separable, though only RDA has achieved 100% correct classification. (RDA: 100%, QDA: 99.4%, LDA: 98.9%, 1NN: 96.1% (z-transformed data)) (All results using the leave-one-out technique)" A more suitable method might be, for example, QDA with a 99.4% accuracy or RDA.