README: PCA and KPCA Implementation and Analysis Hello! In this project, I've explored dimensionality reduction techniques using Principal Component Analysis (PCA) and Kernel Principal Component Analysis (KPCA) on two datasets: TrainData and TestData. Below, I've detailed the steps I undertook and my findings.
Datasets: I utilized two datasets for this project: TrainData and TestData.
Part 1: Principal Component Analysis (PCA) 1.1 PCA Implementation from Scratch: a. I've written a custom Python code to implement PCA without relying on third-party libraries.
b. My PCA implementation was applied to the TrainData dataset to reduce its dimensionality.
c. To retain a significant amount of variance in the dataset, I carefully selected an appropriate number of principal components.
1.2 PCA using scikit-learn: a. I imported the PCA module from the popular machine learning library, sklearn.
b. Using this module, I then applied PCA to the TrainData dataset.
Part 2: Kernel PCA (KPCA) 2.1 KPCA with RBF Kernel: a. I implemented the Kernel PCA using the Radial Basis Function (RBF) kernel from scratch.
b. Subsequently, this KPCA implementation was applied to the TrainData dataset.
2.2 KPCA with Polynomial Kernel: a. I delved further into KPCA by creating a version that uses the Polynomial kernel, again crafted from scratch.
b. This version was also applied to the TrainData dataset.
2.3 KPCA with Linear Kernel: a. Lastly, I implemented KPCA with a Linear kernel from scratch.
b. And, like the previous implementations, I applied it to the TrainData dataset.
Part 3: Testing and Evaluation 3.1 Applying PCA and KPCA to the Test Dataset: a. I used the trained PCA and KPCA models (both RBF and Polynomial) from the training phase to transform the TestDataset.
b. To ensure consistency, I made certain that the dimensionality reduction performed on the test data matched the training data.
3.2 Classification Experiment: a. For classification purposes, I utilized a minimum distance classifier on the TestDataset.
b. I assessed the classification performance using accuracy metrics to determine the effectiveness of my models.
Conclusion: I summarized my results, highlighting the potency of PCA and KPCA in reducing dimensionality. Moreover, I reflected on their impact on classification performance. I also discussed the pros and cons of creating these algorithms from scratch in comparison to employing scikit-learn's pre-built functions.