Assignment 1: Expectation-Maximization and Sketch Classification

This repository implements a complete machine learning pipeline for clustering and classification tasks, fulfilling the requirements of CMPE 544 Assignment 1. It contains: 1. EM Algorithm implementation for clustering synthetic data using a Gaussian Mixture Model (GMM) 2. A full pipeline for classifying Quick Draw sketches into 5 categories, using features extracted from raw images and implementing classifiers from scratch.

Part 1: Expectation-Maximization for Gaussian Mixtures (`em.py`)

Implements the EM algorithm to fit a mixture of Gaussian distributions to a dataset:

Learns parameters for 3 Gaussian components
Performs soft clustering of the data
Optimizes the model using maximum likelihood estimation

Implementation details:

Random initialization strategy for cluster means
Robust covariance estimation with regularization
Deterministic convergence criterion based on log-likelihood
Complete algorithm implementation from scratch (no use of scikit-learn GMM)

Key aspects of the EM implementation:

E-step: Calculates responsibilities (posterior probabilities) of each point for each cluster
M-step: Updates model parameters (weights, means, covariances) based on responsibilities
Numerical stability enhancements with regularized covariance matrices
Log-space calculations to prevent underflow issues

Evaluates clustering quality using:

Silhouette score for cluster evaluation
Log-likelihood convergence visualization
Covariance ellipse visualization

Includes visualizations of:

Data scatter plot
Cluster assignments with different colors
Estimated Gaussian distributions with covariance ellipses
Log-likelihood convergence over iterations

Usage

python em.py

Part 2: Quick Draw Sketch Classification

Focuses on the classification of sketches from a subset of the Quick Draw dataset, which includes 5 classes:

Rabbit (0)
Yoga (1)
Hand (2)
Snowman (3)
Motorbike (4)

The complete pipeline includes:

Multiple feature extraction and preprocessing techniques
Dimensionality reduction methods
Unsupervised learning (clustering) with Expectation-Maximization
Supervised learning with three different classifier implementations

Dataset Details

The dataset consists of 28×28 grayscale images of hand-drawn sketches:

Training set: 20,000 images (4,000 per class)
Test set: 5,000 images (1,000 per class)

Each sketch is represented as a 28×28 pixel grayscale image where pixel values range from 0 (white) to 255 (black).

Loading Data

import numpy as np

train_images = np.load('quickdraw_subset_np/train_images.npy')
train_labels = np.load('quickdraw_subset_np/train_labels.npy')
test_images = np.load('quickdraw_subset_np/test_images.npy')
test_labels = np.load('quickdraw_subset_np/test_labels.npy')

print(train_images.shape)  # (20000, 28, 28)
print(test_images.shape)   # (5000, 28, 28)

Visualizing Images

from PIL import Image
import matplotlib.pyplot as plt

# Display a single image
random_image = np.random.randint(0, train_images.shape[0], size=1)[0]
Image.fromarray(train_images[random_image]).show()

# Or display multiple images with matplotlib
fig, axes = plt.subplots(1, 5, figsize=(15, 3))
for i, class_idx in enumerate(range(5)):
    indices = np.where(train_labels == class_idx)[0]
    img_idx = np.random.choice(indices)
    axes[i].imshow(train_images[img_idx], cmap='gray')
    axes[i].set_title(f'Class {class_idx}')
    axes[i].axis('off')
plt.tight_layout()
plt.show()

Feature Extraction Methods

1. Basic Features (`c_pp_features.py`)

This module extracts multiple types of handcrafted features from the sketch images:

Statistical Features

Basic statistics (mean, standard deviation, skewness, kurtosis)
Foreground pixel count
Center of mass (x, y coordinates)

Zoning Features

Divides each image into zones (e.g., 4×4 grid)
Extracts statistics for each zone (mean, std, foreground pixels)
Captures spatial distribution of the sketch

Shape Features

Contour-based features (area, perimeter, compactness)
Hu moments (7 rotation, scale, and translation invariant moments)
Aspect ratio of bounding rectangle

Texture Features

Local Binary Patterns (LBP): Captures local texture patterns
Gray-Level Co-occurrence Matrix (GLCM): Captures texture based on pixel pair relationships
Haralick features derived from GLCM (contrast, correlation, energy, homogeneity)

Complete feature vector dimensionality: ~100-150 features

2. HOG Features (`c_pp_hog.py`)

Histogram of Oriented Gradients (HOG) is a feature descriptor that:

Captures the distribution of gradient directions in the image
Is particularly effective for shape detection
Provides robustness to illumination changes

HOG Implementation Details:

Orientations: 9 (number of gradient orientation bins)
Pixels per cell: (4, 4) (cell size for computing histograms)
Cells per block: (2, 2) (normalization block size)
PCA applied to reduce dimensions while preserving 90% of variance

The HOG feature extraction process:

Divide the image into small cells
Calculate gradient magnitude and orientation for each pixel
Create histograms of gradient orientations for each cell
Normalize histograms across blocks of cells
Flatten into a feature vector

3. PCA on Raw Images (`c_pp_pca.py`)

Applies Principal Component Analysis directly to flattened images:

Flattens 28×28 images into 784-dimensional vectors
Applies PCA to reduce dimensions while preserving 95% of variance
Visualizes principal components as images
Analyzes eigenfaces/eigensketches (principal components)
Implements feature reconstruction to visualize how PCA represents the sketches
Analyzes class separability using PCA features

The implementation includes:

Comprehensive analysis of explained variance
Visualization of principal components as images
Image reconstruction with varying numbers of components
t-SNE visualization for class separability analysis

Classification Algorithms

1. K-Nearest Neighbors (`knn.py`)

Implements KNN classifier from scratch with:

Four distance metrics:
- Euclidean: Standard straight-line distance (sqrt(sum((x - y)^2)))
- Manhattan: Sum of absolute differences (sum(|x - y|))
- Cosine: Angle between vectors (1 - (x·y)/(||x||·||y||))
- Chebyshev: Maximum coordinate difference (max(|x - y|))
Vectorized implementation for efficiency:
- Uses NumPy's cdist for fast distance calculations
- Handles large datasets efficiently
Hyperparameter tuning:
- Cross-validation for optimal k value selection
- Distance metric selection
- Detailed performance analysis across parameters
In-depth evaluation:
- Confusion matrices for classification errors
- Performance visualization across different k values
- Trade-off analysis between accuracy and computational efficiency

2. Logistic Regression (`lrc.py`)

Implements multi-class logistic regression from scratch with:

One-vs-rest strategy for multi-class classification
Mini-batch gradient descent optimization
Comprehensive regularization options:
- L1 regularization (Lasso): Encourages sparsity
- L2 regularization (Ridge): Prevents overfitting
- Configurable regularization strength (λ)

Implementation details:

Numerical stability enhancements:
- Log-sum-exp trick for stable softmax
- Gradient clipping
- Proper initialization
Loss function:
- Binary cross-entropy for binary classification
- Categorical cross-entropy for multi-class
- Regularization penalty terms
Convergence criteria:
- Tolerance-based early stopping
- Maximum iteration limit
Analysis of feature weights:
- Visualization of most important features
- Class-specific feature importance

3. Gaussian Naive Bayes (`nbc.py`)

Implements Gaussian Naive Bayes classifier from scratch:

Assumes features follow normal distributions within each class
Efficient computation of class posteriors
Vectorized implementation for speed

Advanced features:

Feature selection variant that selects most informative features:
- Uses mutual information to rank features
- Configurable number of features to select
- Analysis of accuracy vs. feature count
Log-space computation:
- Prevents numerical underflow with log-probabilities
- Improves numerical stability
Thorough evaluation:
- Comparison between standard and feature-selected variants
- Analysis of optimal feature count
- Visualization of decision boundaries

Detailed System Architecture

The system follows a modular design with four main stages:

Preprocessing and Feature Extraction
- Image normalization
- Feature extraction (HOG, basic features, PCA)
- Feature standardization
- Dimensionality reduction
Classifier Training
- Cross-validation for hyperparameter tuning
- Model training with optimal parameters
- Model evaluation on test set
Evaluation and Analysis
- Accuracy, precision, recall, F1-score
- Confusion matrix visualization
- Class-specific performance analysis
- Model comparison

Usage

Feature Extraction

# For basic features
python c_pp_features.py

# For HOG features
python c_pp_hog.py

# For PCA on raw images
python c_pp_pca.py

EM Clustering

python em.py

Classification

# For K-Nearest Neighbors
python knn.py

# For Logistic Regression
python lrc.py

# For Naive Bayes
python nbc.py

Configuration

Each classifier has a configuration section at the top of its file that allows customization:

KNN Configuration

CONFIG = {
    'feature_type': 'hog',  # 'hog', 'pca', 'features'
    'preprocessing': 'pca',  # 'raw', 'std', 'pca'
    'cv_folds': 5,
    # ...
}

Logistic Regression Configuration

CONFIG = {
    'feature_type': 'hog',
    'preprocessing': 'pca',
    'learning_rate': 0.01,
    'max_iter': 500,
    'batch_size': 200,
    'tol': 1e-4,
    'regularization_configs': [
        {'reg_type': None, 'reg_strength': 0.0, 'name': 'No Regularization'},
        {'reg_type': 'l2', 'reg_strength': 0.001, 'name': 'L2 (λ=0.001)'},
        # ...
    ]
}

Naive Bayes Configuration

CONFIG = {
    'feature_type': 'pca',
    'preprocessing': 'pca',
    'feature_selection': True,
    'feature_counts_to_test': [1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 20, 25, 30, 40, 50, 60],
    # ...
}

Results and Analysis

The system generates comprehensive evaluation results:

Feature Extraction Analysis

Feature type distribution
Class separability with different feature types
Intra-class variations and inter-class distances

Classifier Performance

Accuracy, training time, inference time
Optimal hyperparameters
Regularization effects (for logistic regression)
Feature selection impact (for naive Bayes)

Results Structure

All results are saved to the model_evaluation_results directory with sub-directories for each feature type and classifier:

model_evaluation_results/
├── hog_pca/
│   ├── accelerated_knn/
│   ├── comparison/
│   ├── lrc/
│   └── nbc/
├── pca_pca/
│   ├── accelerated_knn/
│   ├── comparison/
│   └── nbc/

Performance Results

HOG+PCA Features

The system evaluation generated detailed performance metrics for all classifiers using HOG features reduced with PCA:

K-Nearest Neighbors

Best distance metric: Euclidean
Optimal k value: 5
Test accuracy: 0.9046
Fast inference using KD-Tree acceleration

Logistic Regression

Best regularization: L2 with λ=0.001
Test accuracy: 0.9084
Feature importance visualizations show distinctive patterns for each class

Naive Bayes

Standard Gaussian NB accuracy: 0.8872
Feature selection improved accuracy to 0.9026
Optimal feature count: 12

PCA Features

Performance metrics for classifiers using raw image features reduced with PCA:

K-Nearest Neighbors

Best distance metric: Cosine
Test accuracy: 0.8594
Different optimal metric than with HOG features

Naive Bayes

Feature selection significantly improved performance
Optimal feature count: 30
Test accuracy: 0.8406

Requirements and Installation

The project requires the following Python libraries:

NumPy (1.19+)
SciPy (1.5+)
Matplotlib (3.3+)
scikit-learn (0.23+)
scikit-image (0.17+)
PIL/Pillow (8.0+)
OpenCV (cv2) (4.0+)
mahotas (1.4+)
pandas (1.1+)

Setting up a Virtual Environment

It is recommended to use a virtual environment to avoid conflicts with other Python projects. Here's how to set up and install all requirements:

# Create a virtual environment
python -m venv venv

# Activate the virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

# Install all dependencies from requirements.txt
pip install -r requirements.txt

Running the Project

After setting up the environment, you can run any of the scripts using Python:

# To run the Expectation-Maximization clustering algorithm
python em.py

# For feature extraction
python c_pp_features.py  # Basic features
python c_pp_hog.py       # HOG features
python c_pp_pca.py       # PCA on raw images

# For classification algorithms
python knn.py            # K-Nearest Neighbors
python lrc.py            # Logistic Regression
python nbc.py            # Naive Bayes

Each script will generate output files in their respective directories, including visualizations and processed data.

Project Structure

.
├── c_features/               # Basic features directory
│   ├── pp/                   # Preprocessing visualizations
│   └── processed_data/       # Extracted features
├── c_hog/                    # HOG features directory
│   ├── pp/                   # Preprocessing visualizations
│   └── processed_data/       # Extracted features
├── c_pca/                    # PCA features directory
│   ├── pp/                   # Preprocessing visualizations
│   └── processed_data/       # Extracted features
├── c_pp_features.py          # Basic feature extraction methods
├── c_pp_hog.py               # HOG feature extraction
├── c_pp_pca.py               # PCA on raw images
├── dataset.npy               # Processed dataset for EM
├── em.py                     # EM algorithm implementation
├── em/                       # EM visualization outputs
│   ├── cluster_assignments.png
│   ├── data_scatter_plot.png
│   ├── estimated_gaussians.png
│   └── log_likelihood_convergence.png
├── knn.py                    # KNN classifier implementation
├── lrc.py                    # Logistic Regression implementation
├── nbc.py                    # Naive Bayes implementation
├── model_evaluation_results/ # Generated evaluation results
│   ├── hog_pca/              # Results for HOG features with PCA
│   └── pca_pca/              # Results for PCA on raw images
└── quickdraw_subset_np/      # Original dataset
    ├── README.md
    ├── sample_image.png
    ├── test_images.npy
    ├── test_labels.npy
    ├── train_images.npy
    └── train_labels.npy

Implementation Details

Feature Extraction

All feature extraction methods include data standardization
PCA applied for dimensionality reduction
Feature outputs saved in raw, standardized, and PCA-reduced formats

KNN Classifier

Distance calculations use vectorized operations for efficiency
Cross-validation implemented for hyperparameter tuning
Handles ties in voting through first-occurrence preference
Accelerated implementation using KD-Tree and Ball Tree

Logistic Regression

Includes L1 and L2 regularization with configurable strength
Mini-batch gradient descent for better convergence
Numerically stable implementations of softmax and cross-entropy
Visualizations of class-specific feature weights

Naive Bayes

Gaussian probability distribution estimation for each feature
Feature selection using mutual information
Log-space computations for numerical stability
Analysis of optimal feature count

Comparative Performance

The classifiers show different strengths depending on the feature type:

KNN:
- Performs well with HOG features
- Best distance metric depends on feature type
- Accuracy decreases with very low or very high k values
Logistic Regression:
- Strong performance with regularization
- L2 regularization generally outperforms L1
- More sensitive to feature quality than KNN
Naive Bayes:
- Feature selection significantly improves performance
- Works well with independent features
- Fastest inference time of all classifiers

Overall, the combination of HOG features with KNN or logistic regression tends to provide the best accuracy, while Naive Bayes with feature selection offers the best accuracy/speed tradeoff.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
c_features/pp		c_features/pp
c_hog/pp		c_hog/pp
c_pca/pp		c_pca/pp
em		em
model_evaluation_results		model_evaluation_results
quickdraw_subset_np		quickdraw_subset_np
.gitignore		.gitignore
README.md		README.md
c_pp_features.py		c_pp_features.py
c_pp_hog.py		c_pp_hog.py
c_pp_pca.py		c_pp_pca.py
dataset.npy		dataset.npy
em.py		em.py
knn.py		knn.py
lrc.py		lrc.py
nbc.py		nbc.py
requirements.txt		requirements.txt

ozankaymak/cmpe544-assignment1

Folders and files

Latest commit

History

Repository files navigation

Assignment 1: Expectation-Maximization and Sketch Classification

Part 1: Expectation-Maximization for Gaussian Mixtures (em.py)

Usage

Part 2: Quick Draw Sketch Classification

Dataset Details

Loading Data

Visualizing Images

Feature Extraction Methods

1. Basic Features (c_pp_features.py)

Statistical Features

Zoning Features

Shape Features

Texture Features

2. HOG Features (c_pp_hog.py)

3. PCA on Raw Images (c_pp_pca.py)

Classification Algorithms

1. K-Nearest Neighbors (knn.py)

2. Logistic Regression (lrc.py)

3. Gaussian Naive Bayes (nbc.py)

Detailed System Architecture

Usage

Feature Extraction

EM Clustering

Classification

Configuration

KNN Configuration

Logistic Regression Configuration

Naive Bayes Configuration

Results and Analysis

Feature Extraction Analysis

Classifier Performance

Results Structure

Performance Results

HOG+PCA Features

K-Nearest Neighbors

Logistic Regression

Naive Bayes

PCA Features

K-Nearest Neighbors

Naive Bayes

Requirements and Installation

Setting up a Virtual Environment

Running the Project

Project Structure

Implementation Details

Feature Extraction

KNN Classifier

Logistic Regression

Naive Bayes

Comparative Performance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Part 1: Expectation-Maximization for Gaussian Mixtures (`em.py`)

1. Basic Features (`c_pp_features.py`)

2. HOG Features (`c_pp_hog.py`)

3. PCA on Raw Images (`c_pp_pca.py`)

1. K-Nearest Neighbors (`knn.py`)

2. Logistic Regression (`lrc.py`)

3. Gaussian Naive Bayes (`nbc.py`)

Packages