Skip to content

ozankaymak/cmpe544-assignment1

Repository files navigation

Assignment 1: Expectation-Maximization and Sketch Classification

This repository implements a complete machine learning pipeline for clustering and classification tasks, fulfilling the requirements of CMPE 544 Assignment 1. It contains: 1. EM Algorithm implementation for clustering synthetic data using a Gaussian Mixture Model (GMM) 2. A full pipeline for classifying Quick Draw sketches into 5 categories, using features extracted from raw images and implementing classifiers from scratch.

Part 1: Expectation-Maximization for Gaussian Mixtures (em.py)

Implements the EM algorithm to fit a mixture of Gaussian distributions to a dataset:

  • Learns parameters for 3 Gaussian components
  • Performs soft clustering of the data
  • Optimizes the model using maximum likelihood estimation

Implementation details:

  • Random initialization strategy for cluster means
  • Robust covariance estimation with regularization
  • Deterministic convergence criterion based on log-likelihood
  • Complete algorithm implementation from scratch (no use of scikit-learn GMM)

Key aspects of the EM implementation:

  • E-step: Calculates responsibilities (posterior probabilities) of each point for each cluster
  • M-step: Updates model parameters (weights, means, covariances) based on responsibilities
  • Numerical stability enhancements with regularized covariance matrices
  • Log-space calculations to prevent underflow issues

Evaluates clustering quality using:

  • Silhouette score for cluster evaluation
  • Log-likelihood convergence visualization
  • Covariance ellipse visualization

Includes visualizations of:

  • Data scatter plot
  • Cluster assignments with different colors
  • Estimated Gaussian distributions with covariance ellipses
  • Log-likelihood convergence over iterations

Usage

python em.py

Part 2: Quick Draw Sketch Classification

Focuses on the classification of sketches from a subset of the Quick Draw dataset, which includes 5 classes:

  • Rabbit (0)
  • Yoga (1)
  • Hand (2)
  • Snowman (3)
  • Motorbike (4)

The complete pipeline includes:

  1. Multiple feature extraction and preprocessing techniques
  2. Dimensionality reduction methods
  3. Unsupervised learning (clustering) with Expectation-Maximization
  4. Supervised learning with three different classifier implementations

Dataset Details

The dataset consists of 28×28 grayscale images of hand-drawn sketches:

  • Training set: 20,000 images (4,000 per class)
  • Test set: 5,000 images (1,000 per class)

Each sketch is represented as a 28×28 pixel grayscale image where pixel values range from 0 (white) to 255 (black).

Loading Data

import numpy as np

train_images = np.load('quickdraw_subset_np/train_images.npy')
train_labels = np.load('quickdraw_subset_np/train_labels.npy')
test_images = np.load('quickdraw_subset_np/test_images.npy')
test_labels = np.load('quickdraw_subset_np/test_labels.npy')

print(train_images.shape)  # (20000, 28, 28)
print(test_images.shape)   # (5000, 28, 28)

Visualizing Images

from PIL import Image
import matplotlib.pyplot as plt

# Display a single image
random_image = np.random.randint(0, train_images.shape[0], size=1)[0]
Image.fromarray(train_images[random_image]).show()

# Or display multiple images with matplotlib
fig, axes = plt.subplots(1, 5, figsize=(15, 3))
for i, class_idx in enumerate(range(5)):
    indices = np.where(train_labels == class_idx)[0]
    img_idx = np.random.choice(indices)
    axes[i].imshow(train_images[img_idx], cmap='gray')
    axes[i].set_title(f'Class {class_idx}')
    axes[i].axis('off')
plt.tight_layout()
plt.show()

Sample Image

Feature Extraction Methods

1. Basic Features (c_pp_features.py)

This module extracts multiple types of handcrafted features from the sketch images:

Statistical Features

  • Basic statistics (mean, standard deviation, skewness, kurtosis)
  • Foreground pixel count
  • Center of mass (x, y coordinates)

Zoning Features

  • Divides each image into zones (e.g., 4×4 grid)
  • Extracts statistics for each zone (mean, std, foreground pixels)
  • Captures spatial distribution of the sketch

Shape Features

  • Contour-based features (area, perimeter, compactness)
  • Hu moments (7 rotation, scale, and translation invariant moments)
  • Aspect ratio of bounding rectangle

Texture Features

  • Local Binary Patterns (LBP): Captures local texture patterns
  • Gray-Level Co-occurrence Matrix (GLCM): Captures texture based on pixel pair relationships
  • Haralick features derived from GLCM (contrast, correlation, energy, homogeneity)

Complete feature vector dimensionality: ~100-150 features

2. HOG Features (c_pp_hog.py)

Histogram of Oriented Gradients (HOG) is a feature descriptor that:

  • Captures the distribution of gradient directions in the image
  • Is particularly effective for shape detection
  • Provides robustness to illumination changes

HOG Implementation Details:

  • Orientations: 9 (number of gradient orientation bins)
  • Pixels per cell: (4, 4) (cell size for computing histograms)
  • Cells per block: (2, 2) (normalization block size)
  • PCA applied to reduce dimensions while preserving 90% of variance

The HOG feature extraction process:

  1. Divide the image into small cells
  2. Calculate gradient magnitude and orientation for each pixel
  3. Create histograms of gradient orientations for each cell
  4. Normalize histograms across blocks of cells
  5. Flatten into a feature vector

3. PCA on Raw Images (c_pp_pca.py)

Applies Principal Component Analysis directly to flattened images:

  • Flattens 28×28 images into 784-dimensional vectors
  • Applies PCA to reduce dimensions while preserving 95% of variance
  • Visualizes principal components as images
  • Analyzes eigenfaces/eigensketches (principal components)
  • Implements feature reconstruction to visualize how PCA represents the sketches
  • Analyzes class separability using PCA features

The implementation includes:

  • Comprehensive analysis of explained variance
  • Visualization of principal components as images
  • Image reconstruction with varying numbers of components
  • t-SNE visualization for class separability analysis

Classification Algorithms

1. K-Nearest Neighbors (knn.py)

Implements KNN classifier from scratch with:

  • Four distance metrics:

    • Euclidean: Standard straight-line distance (sqrt(sum((x - y)^2)))
    • Manhattan: Sum of absolute differences (sum(|x - y|))
    • Cosine: Angle between vectors (1 - (x·y)/(||x||·||y||))
    • Chebyshev: Maximum coordinate difference (max(|x - y|))
  • Vectorized implementation for efficiency:

    • Uses NumPy's cdist for fast distance calculations
    • Handles large datasets efficiently
  • Hyperparameter tuning:

    • Cross-validation for optimal k value selection
    • Distance metric selection
    • Detailed performance analysis across parameters
  • In-depth evaluation:

    • Confusion matrices for classification errors
    • Performance visualization across different k values
    • Trade-off analysis between accuracy and computational efficiency

2. Logistic Regression (lrc.py)

Implements multi-class logistic regression from scratch with:

  • One-vs-rest strategy for multi-class classification
  • Mini-batch gradient descent optimization
  • Comprehensive regularization options:
    • L1 regularization (Lasso): Encourages sparsity
    • L2 regularization (Ridge): Prevents overfitting
    • Configurable regularization strength (λ)

Implementation details:

  • Numerical stability enhancements:

    • Log-sum-exp trick for stable softmax
    • Gradient clipping
    • Proper initialization
  • Loss function:

    • Binary cross-entropy for binary classification
    • Categorical cross-entropy for multi-class
    • Regularization penalty terms
  • Convergence criteria:

    • Tolerance-based early stopping
    • Maximum iteration limit
  • Analysis of feature weights:

    • Visualization of most important features
    • Class-specific feature importance

3. Gaussian Naive Bayes (nbc.py)

Implements Gaussian Naive Bayes classifier from scratch:

  • Assumes features follow normal distributions within each class
  • Efficient computation of class posteriors
  • Vectorized implementation for speed

Advanced features:

  • Feature selection variant that selects most informative features:

    • Uses mutual information to rank features
    • Configurable number of features to select
    • Analysis of accuracy vs. feature count
  • Log-space computation:

    • Prevents numerical underflow with log-probabilities
    • Improves numerical stability
  • Thorough evaluation:

    • Comparison between standard and feature-selected variants
    • Analysis of optimal feature count
    • Visualization of decision boundaries

Detailed System Architecture

The system follows a modular design with four main stages:

  1. Preprocessing and Feature Extraction

    • Image normalization
    • Feature extraction (HOG, basic features, PCA)
    • Feature standardization
    • Dimensionality reduction
  2. Classifier Training

    • Cross-validation for hyperparameter tuning
    • Model training with optimal parameters
    • Model evaluation on test set
  3. Evaluation and Analysis

    • Accuracy, precision, recall, F1-score
    • Confusion matrix visualization
    • Class-specific performance analysis
    • Model comparison

Usage

Feature Extraction

# For basic features
python c_pp_features.py

# For HOG features
python c_pp_hog.py

# For PCA on raw images
python c_pp_pca.py

EM Clustering

python em.py

Classification

# For K-Nearest Neighbors
python knn.py

# For Logistic Regression
python lrc.py

# For Naive Bayes
python nbc.py

Configuration

Each classifier has a configuration section at the top of its file that allows customization:

KNN Configuration

CONFIG = {
    'feature_type': 'hog',  # 'hog', 'pca', 'features'
    'preprocessing': 'pca',  # 'raw', 'std', 'pca'
    'cv_folds': 5,
    # ...
}

Logistic Regression Configuration

CONFIG = {
    'feature_type': 'hog',
    'preprocessing': 'pca',
    'learning_rate': 0.01,
    'max_iter': 500,
    'batch_size': 200,
    'tol': 1e-4,
    'regularization_configs': [
        {'reg_type': None, 'reg_strength': 0.0, 'name': 'No Regularization'},
        {'reg_type': 'l2', 'reg_strength': 0.001, 'name': 'L2 (λ=0.001)'},
        # ...
    ]
}

Naive Bayes Configuration

CONFIG = {
    'feature_type': 'pca',
    'preprocessing': 'pca',
    'feature_selection': True,
    'feature_counts_to_test': [1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 20, 25, 30, 40, 50, 60],
    # ...
}

Results and Analysis

The system generates comprehensive evaluation results:

Feature Extraction Analysis

  • Feature type distribution
  • Class separability with different feature types
  • Intra-class variations and inter-class distances

Classifier Performance

  • Accuracy, training time, inference time
  • Optimal hyperparameters
  • Regularization effects (for logistic regression)
  • Feature selection impact (for naive Bayes)

Results Structure

All results are saved to the model_evaluation_results directory with sub-directories for each feature type and classifier:

model_evaluation_results/
├── hog_pca/
│   ├── accelerated_knn/
│   ├── comparison/
│   ├── lrc/
│   └── nbc/
├── pca_pca/
│   ├── accelerated_knn/
│   ├── comparison/
│   └── nbc/

Performance Results

HOG+PCA Features

The system evaluation generated detailed performance metrics for all classifiers using HOG features reduced with PCA:

K-Nearest Neighbors

  • Best distance metric: Euclidean
  • Optimal k value: 5
  • Test accuracy: 0.9046
  • Fast inference using KD-Tree acceleration

Logistic Regression

  • Best regularization: L2 with λ=0.001
  • Test accuracy: 0.9084
  • Feature importance visualizations show distinctive patterns for each class

Naive Bayes

  • Standard Gaussian NB accuracy: 0.8872
  • Feature selection improved accuracy to 0.9026
  • Optimal feature count: 12

PCA Features

Performance metrics for classifiers using raw image features reduced with PCA:

K-Nearest Neighbors

  • Best distance metric: Cosine
  • Test accuracy: 0.8594
  • Different optimal metric than with HOG features

Naive Bayes

  • Feature selection significantly improved performance
  • Optimal feature count: 30
  • Test accuracy: 0.8406

Requirements and Installation

The project requires the following Python libraries:

  • NumPy (1.19+)
  • SciPy (1.5+)
  • Matplotlib (3.3+)
  • scikit-learn (0.23+)
  • scikit-image (0.17+)
  • PIL/Pillow (8.0+)
  • OpenCV (cv2) (4.0+)
  • mahotas (1.4+)
  • pandas (1.1+)

Setting up a Virtual Environment

It is recommended to use a virtual environment to avoid conflicts with other Python projects. Here's how to set up and install all requirements:

# Create a virtual environment
python -m venv venv

# Activate the virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

# Install all dependencies from requirements.txt
pip install -r requirements.txt

Running the Project

After setting up the environment, you can run any of the scripts using Python:

# To run the Expectation-Maximization clustering algorithm
python em.py

# For feature extraction
python c_pp_features.py  # Basic features
python c_pp_hog.py       # HOG features
python c_pp_pca.py       # PCA on raw images

# For classification algorithms
python knn.py            # K-Nearest Neighbors
python lrc.py            # Logistic Regression
python nbc.py            # Naive Bayes

Each script will generate output files in their respective directories, including visualizations and processed data.

Project Structure

.
├── c_features/               # Basic features directory
│   ├── pp/                   # Preprocessing visualizations
│   └── processed_data/       # Extracted features
├── c_hog/                    # HOG features directory
│   ├── pp/                   # Preprocessing visualizations
│   └── processed_data/       # Extracted features
├── c_pca/                    # PCA features directory
│   ├── pp/                   # Preprocessing visualizations
│   └── processed_data/       # Extracted features
├── c_pp_features.py          # Basic feature extraction methods
├── c_pp_hog.py               # HOG feature extraction
├── c_pp_pca.py               # PCA on raw images
├── dataset.npy               # Processed dataset for EM
├── em.py                     # EM algorithm implementation
├── em/                       # EM visualization outputs
│   ├── cluster_assignments.png
│   ├── data_scatter_plot.png
│   ├── estimated_gaussians.png
│   └── log_likelihood_convergence.png
├── knn.py                    # KNN classifier implementation
├── lrc.py                    # Logistic Regression implementation
├── nbc.py                    # Naive Bayes implementation
├── model_evaluation_results/ # Generated evaluation results
│   ├── hog_pca/              # Results for HOG features with PCA
│   └── pca_pca/              # Results for PCA on raw images
└── quickdraw_subset_np/      # Original dataset
    ├── README.md
    ├── sample_image.png
    ├── test_images.npy
    ├── test_labels.npy
    ├── train_images.npy
    └── train_labels.npy

Implementation Details

Feature Extraction

  • All feature extraction methods include data standardization
  • PCA applied for dimensionality reduction
  • Feature outputs saved in raw, standardized, and PCA-reduced formats

KNN Classifier

  • Distance calculations use vectorized operations for efficiency
  • Cross-validation implemented for hyperparameter tuning
  • Handles ties in voting through first-occurrence preference
  • Accelerated implementation using KD-Tree and Ball Tree

Logistic Regression

  • Includes L1 and L2 regularization with configurable strength
  • Mini-batch gradient descent for better convergence
  • Numerically stable implementations of softmax and cross-entropy
  • Visualizations of class-specific feature weights

Naive Bayes

  • Gaussian probability distribution estimation for each feature
  • Feature selection using mutual information
  • Log-space computations for numerical stability
  • Analysis of optimal feature count

Comparative Performance

The classifiers show different strengths depending on the feature type:

  1. KNN:

    • Performs well with HOG features
    • Best distance metric depends on feature type
    • Accuracy decreases with very low or very high k values
  2. Logistic Regression:

    • Strong performance with regularization
    • L2 regularization generally outperforms L1
    • More sensitive to feature quality than KNN
  3. Naive Bayes:

    • Feature selection significantly improves performance
    • Works well with independent features
    • Fastest inference time of all classifiers

Overall, the combination of HOG features with KNN or logistic regression tends to provide the best accuracy, while Naive Bayes with feature selection offers the best accuracy/speed tradeoff.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages