This repository implements a complete machine learning pipeline for clustering and classification tasks, fulfilling the requirements of CMPE 544 Assignment 1. It contains: 1. EM Algorithm implementation for clustering synthetic data using a Gaussian Mixture Model (GMM) 2. A full pipeline for classifying Quick Draw sketches into 5 categories, using features extracted from raw images and implementing classifiers from scratch.
Implements the EM algorithm to fit a mixture of Gaussian distributions to a dataset:
- Learns parameters for 3 Gaussian components
- Performs soft clustering of the data
- Optimizes the model using maximum likelihood estimation
Implementation details:
- Random initialization strategy for cluster means
- Robust covariance estimation with regularization
- Deterministic convergence criterion based on log-likelihood
- Complete algorithm implementation from scratch (no use of scikit-learn GMM)
Key aspects of the EM implementation:
- E-step: Calculates responsibilities (posterior probabilities) of each point for each cluster
- M-step: Updates model parameters (weights, means, covariances) based on responsibilities
- Numerical stability enhancements with regularized covariance matrices
- Log-space calculations to prevent underflow issues
Evaluates clustering quality using:
- Silhouette score for cluster evaluation
- Log-likelihood convergence visualization
- Covariance ellipse visualization
Includes visualizations of:
- Data scatter plot
- Cluster assignments with different colors
- Estimated Gaussian distributions with covariance ellipses
- Log-likelihood convergence over iterations
python em.pyFocuses on the classification of sketches from a subset of the Quick Draw dataset, which includes 5 classes:
- Rabbit (0)
- Yoga (1)
- Hand (2)
- Snowman (3)
- Motorbike (4)
The complete pipeline includes:
- Multiple feature extraction and preprocessing techniques
- Dimensionality reduction methods
- Unsupervised learning (clustering) with Expectation-Maximization
- Supervised learning with three different classifier implementations
The dataset consists of 28×28 grayscale images of hand-drawn sketches:
- Training set: 20,000 images (4,000 per class)
- Test set: 5,000 images (1,000 per class)
Each sketch is represented as a 28×28 pixel grayscale image where pixel values range from 0 (white) to 255 (black).
import numpy as np
train_images = np.load('quickdraw_subset_np/train_images.npy')
train_labels = np.load('quickdraw_subset_np/train_labels.npy')
test_images = np.load('quickdraw_subset_np/test_images.npy')
test_labels = np.load('quickdraw_subset_np/test_labels.npy')
print(train_images.shape) # (20000, 28, 28)
print(test_images.shape) # (5000, 28, 28)from PIL import Image
import matplotlib.pyplot as plt
# Display a single image
random_image = np.random.randint(0, train_images.shape[0], size=1)[0]
Image.fromarray(train_images[random_image]).show()
# Or display multiple images with matplotlib
fig, axes = plt.subplots(1, 5, figsize=(15, 3))
for i, class_idx in enumerate(range(5)):
indices = np.where(train_labels == class_idx)[0]
img_idx = np.random.choice(indices)
axes[i].imshow(train_images[img_idx], cmap='gray')
axes[i].set_title(f'Class {class_idx}')
axes[i].axis('off')
plt.tight_layout()
plt.show()This module extracts multiple types of handcrafted features from the sketch images:
- Basic statistics (mean, standard deviation, skewness, kurtosis)
- Foreground pixel count
- Center of mass (x, y coordinates)
- Divides each image into zones (e.g., 4×4 grid)
- Extracts statistics for each zone (mean, std, foreground pixels)
- Captures spatial distribution of the sketch
- Contour-based features (area, perimeter, compactness)
- Hu moments (7 rotation, scale, and translation invariant moments)
- Aspect ratio of bounding rectangle
- Local Binary Patterns (LBP): Captures local texture patterns
- Gray-Level Co-occurrence Matrix (GLCM): Captures texture based on pixel pair relationships
- Haralick features derived from GLCM (contrast, correlation, energy, homogeneity)
Complete feature vector dimensionality: ~100-150 features
Histogram of Oriented Gradients (HOG) is a feature descriptor that:
- Captures the distribution of gradient directions in the image
- Is particularly effective for shape detection
- Provides robustness to illumination changes
HOG Implementation Details:
- Orientations: 9 (number of gradient orientation bins)
- Pixels per cell: (4, 4) (cell size for computing histograms)
- Cells per block: (2, 2) (normalization block size)
- PCA applied to reduce dimensions while preserving 90% of variance
The HOG feature extraction process:
- Divide the image into small cells
- Calculate gradient magnitude and orientation for each pixel
- Create histograms of gradient orientations for each cell
- Normalize histograms across blocks of cells
- Flatten into a feature vector
Applies Principal Component Analysis directly to flattened images:
- Flattens 28×28 images into 784-dimensional vectors
- Applies PCA to reduce dimensions while preserving 95% of variance
- Visualizes principal components as images
- Analyzes eigenfaces/eigensketches (principal components)
- Implements feature reconstruction to visualize how PCA represents the sketches
- Analyzes class separability using PCA features
The implementation includes:
- Comprehensive analysis of explained variance
- Visualization of principal components as images
- Image reconstruction with varying numbers of components
- t-SNE visualization for class separability analysis
Implements KNN classifier from scratch with:
-
Four distance metrics:
- Euclidean: Standard straight-line distance (
sqrt(sum((x - y)^2))) - Manhattan: Sum of absolute differences (
sum(|x - y|)) - Cosine: Angle between vectors (
1 - (x·y)/(||x||·||y||)) - Chebyshev: Maximum coordinate difference (
max(|x - y|))
- Euclidean: Standard straight-line distance (
-
Vectorized implementation for efficiency:
- Uses NumPy's
cdistfor fast distance calculations - Handles large datasets efficiently
- Uses NumPy's
-
Hyperparameter tuning:
- Cross-validation for optimal k value selection
- Distance metric selection
- Detailed performance analysis across parameters
-
In-depth evaluation:
- Confusion matrices for classification errors
- Performance visualization across different k values
- Trade-off analysis between accuracy and computational efficiency
Implements multi-class logistic regression from scratch with:
- One-vs-rest strategy for multi-class classification
- Mini-batch gradient descent optimization
- Comprehensive regularization options:
- L1 regularization (Lasso): Encourages sparsity
- L2 regularization (Ridge): Prevents overfitting
- Configurable regularization strength (λ)
Implementation details:
-
Numerical stability enhancements:
- Log-sum-exp trick for stable softmax
- Gradient clipping
- Proper initialization
-
Loss function:
- Binary cross-entropy for binary classification
- Categorical cross-entropy for multi-class
- Regularization penalty terms
-
Convergence criteria:
- Tolerance-based early stopping
- Maximum iteration limit
-
Analysis of feature weights:
- Visualization of most important features
- Class-specific feature importance
Implements Gaussian Naive Bayes classifier from scratch:
- Assumes features follow normal distributions within each class
- Efficient computation of class posteriors
- Vectorized implementation for speed
Advanced features:
-
Feature selection variant that selects most informative features:
- Uses mutual information to rank features
- Configurable number of features to select
- Analysis of accuracy vs. feature count
-
Log-space computation:
- Prevents numerical underflow with log-probabilities
- Improves numerical stability
-
Thorough evaluation:
- Comparison between standard and feature-selected variants
- Analysis of optimal feature count
- Visualization of decision boundaries
The system follows a modular design with four main stages:
-
Preprocessing and Feature Extraction
- Image normalization
- Feature extraction (HOG, basic features, PCA)
- Feature standardization
- Dimensionality reduction
-
Classifier Training
- Cross-validation for hyperparameter tuning
- Model training with optimal parameters
- Model evaluation on test set
-
Evaluation and Analysis
- Accuracy, precision, recall, F1-score
- Confusion matrix visualization
- Class-specific performance analysis
- Model comparison
# For basic features
python c_pp_features.py
# For HOG features
python c_pp_hog.py
# For PCA on raw images
python c_pp_pca.pypython em.py# For K-Nearest Neighbors
python knn.py
# For Logistic Regression
python lrc.py
# For Naive Bayes
python nbc.pyEach classifier has a configuration section at the top of its file that allows customization:
CONFIG = {
'feature_type': 'hog', # 'hog', 'pca', 'features'
'preprocessing': 'pca', # 'raw', 'std', 'pca'
'cv_folds': 5,
# ...
}CONFIG = {
'feature_type': 'hog',
'preprocessing': 'pca',
'learning_rate': 0.01,
'max_iter': 500,
'batch_size': 200,
'tol': 1e-4,
'regularization_configs': [
{'reg_type': None, 'reg_strength': 0.0, 'name': 'No Regularization'},
{'reg_type': 'l2', 'reg_strength': 0.001, 'name': 'L2 (λ=0.001)'},
# ...
]
}CONFIG = {
'feature_type': 'pca',
'preprocessing': 'pca',
'feature_selection': True,
'feature_counts_to_test': [1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 20, 25, 30, 40, 50, 60],
# ...
}The system generates comprehensive evaluation results:
- Feature type distribution
- Class separability with different feature types
- Intra-class variations and inter-class distances
- Accuracy, training time, inference time
- Optimal hyperparameters
- Regularization effects (for logistic regression)
- Feature selection impact (for naive Bayes)
All results are saved to the model_evaluation_results directory with sub-directories for each feature type and classifier:
model_evaluation_results/
├── hog_pca/
│ ├── accelerated_knn/
│ ├── comparison/
│ ├── lrc/
│ └── nbc/
├── pca_pca/
│ ├── accelerated_knn/
│ ├── comparison/
│ └── nbc/
The system evaluation generated detailed performance metrics for all classifiers using HOG features reduced with PCA:
- Best distance metric: Euclidean
- Optimal k value: 5
- Test accuracy: 0.9046
- Fast inference using KD-Tree acceleration
- Best regularization: L2 with λ=0.001
- Test accuracy: 0.9084
- Feature importance visualizations show distinctive patterns for each class
- Standard Gaussian NB accuracy: 0.8872
- Feature selection improved accuracy to 0.9026
- Optimal feature count: 12
Performance metrics for classifiers using raw image features reduced with PCA:
- Best distance metric: Cosine
- Test accuracy: 0.8594
- Different optimal metric than with HOG features
- Feature selection significantly improved performance
- Optimal feature count: 30
- Test accuracy: 0.8406
The project requires the following Python libraries:
- NumPy (1.19+)
- SciPy (1.5+)
- Matplotlib (3.3+)
- scikit-learn (0.23+)
- scikit-image (0.17+)
- PIL/Pillow (8.0+)
- OpenCV (cv2) (4.0+)
- mahotas (1.4+)
- pandas (1.1+)
It is recommended to use a virtual environment to avoid conflicts with other Python projects. Here's how to set up and install all requirements:
# Create a virtual environment
python -m venv venv
# Activate the virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install all dependencies from requirements.txt
pip install -r requirements.txtAfter setting up the environment, you can run any of the scripts using Python:
# To run the Expectation-Maximization clustering algorithm
python em.py
# For feature extraction
python c_pp_features.py # Basic features
python c_pp_hog.py # HOG features
python c_pp_pca.py # PCA on raw images
# For classification algorithms
python knn.py # K-Nearest Neighbors
python lrc.py # Logistic Regression
python nbc.py # Naive BayesEach script will generate output files in their respective directories, including visualizations and processed data.
.
├── c_features/ # Basic features directory
│ ├── pp/ # Preprocessing visualizations
│ └── processed_data/ # Extracted features
├── c_hog/ # HOG features directory
│ ├── pp/ # Preprocessing visualizations
│ └── processed_data/ # Extracted features
├── c_pca/ # PCA features directory
│ ├── pp/ # Preprocessing visualizations
│ └── processed_data/ # Extracted features
├── c_pp_features.py # Basic feature extraction methods
├── c_pp_hog.py # HOG feature extraction
├── c_pp_pca.py # PCA on raw images
├── dataset.npy # Processed dataset for EM
├── em.py # EM algorithm implementation
├── em/ # EM visualization outputs
│ ├── cluster_assignments.png
│ ├── data_scatter_plot.png
│ ├── estimated_gaussians.png
│ └── log_likelihood_convergence.png
├── knn.py # KNN classifier implementation
├── lrc.py # Logistic Regression implementation
├── nbc.py # Naive Bayes implementation
├── model_evaluation_results/ # Generated evaluation results
│ ├── hog_pca/ # Results for HOG features with PCA
│ └── pca_pca/ # Results for PCA on raw images
└── quickdraw_subset_np/ # Original dataset
├── README.md
├── sample_image.png
├── test_images.npy
├── test_labels.npy
├── train_images.npy
└── train_labels.npy
- All feature extraction methods include data standardization
- PCA applied for dimensionality reduction
- Feature outputs saved in raw, standardized, and PCA-reduced formats
- Distance calculations use vectorized operations for efficiency
- Cross-validation implemented for hyperparameter tuning
- Handles ties in voting through first-occurrence preference
- Accelerated implementation using KD-Tree and Ball Tree
- Includes L1 and L2 regularization with configurable strength
- Mini-batch gradient descent for better convergence
- Numerically stable implementations of softmax and cross-entropy
- Visualizations of class-specific feature weights
- Gaussian probability distribution estimation for each feature
- Feature selection using mutual information
- Log-space computations for numerical stability
- Analysis of optimal feature count
The classifiers show different strengths depending on the feature type:
-
KNN:
- Performs well with HOG features
- Best distance metric depends on feature type
- Accuracy decreases with very low or very high k values
-
Logistic Regression:
- Strong performance with regularization
- L2 regularization generally outperforms L1
- More sensitive to feature quality than KNN
-
Naive Bayes:
- Feature selection significantly improves performance
- Works well with independent features
- Fastest inference time of all classifiers
Overall, the combination of HOG features with KNN or logistic regression tends to provide the best accuracy, while Naive Bayes with feature selection offers the best accuracy/speed tradeoff.
