Skip to content

Lancekeepsforward/Unsupervised-Clustering_of_Literary_Authors

Repository files navigation

Unsupervised Clustering of Literary Authors

Project Overview

This project implements various unsupervised learning techniques to cluster and analyze literary authors based on their writing styles and characteristics. The analysis includes dimensionality reduction methods, clustering algorithms, and comprehensive evaluation metrics.

Project Structure

Unsupervised Clustering_of_Literary_Authors/
├── Brief_Introduction_And_Equation_Derivation.pdf  # Theoretical background and equations
├── code.ipynb                                      # Main Jupyter notebook with implementation
├── code_colab.pdf                                  # Code documentation
├── targets.pdf                                     # Project objectives and targets
└── pics/                                          # Generated visualizations
    ├── a/                                         # Dimensionality reduction and analysis plots
    ├── b/                                         # Clustering algorithm results
    ├── c/                                         # Feature analysis and contributions
    ├── d/                                         # Evaluation metrics and stability analysis
    └── second_problem_b/                          # Additional problem analysis

Methodology

1. Dimensionality Reduction Techniques

The project implements several dimensionality reduction methods to analyze author characteristics:

Principal Component Analysis (PCA)

  • PCA with Covariance Matrix: Standard PCA implementation using covariance matrix
  • PCA with Linear Regression: PCA combined with linear regression analysis
  • Kernel PCA:
    • Polynomial Kernel: Non-linear dimensionality reduction using polynomial kernels
    • RBF Kernel: Radial Basis Function kernel for non-linear transformations

Non-negative Matrix Factorization (NMF)

  • Topic modeling approach to extract author writing patterns
  • Analysis of topic vector weights and contributions

UMAP (Uniform Manifold Approximation and Projection)

  • Modern dimensionality reduction technique for preserving local and global structure

Biclustering

  • Simultaneous clustering of authors and features

2. Clustering Algorithms

Traditional Clustering Methods

  • K-means: Standard centroid-based clustering
  • K-medoids: Medoid-based clustering for robustness
  • Hierarchical Clustering:
    • Average Linkage: Average distance between clusters
    • Complete Linkage: Maximum distance between clusters
    • Ward Method: Minimizes within-cluster variance

Advanced Clustering Methods

  • Gaussian Mixture Models (GMM): Probabilistic clustering approach
  • Laplacian Graph K-means: Spectral clustering using graph Laplacian

3. Evaluation and Analysis

Clustering Evaluation Metrics

  • Silhouette Score: Measures clustering quality and separation
  • Stability Analysis: Assesses clustering consistency across different runs
  • Consensus Matrix: Visualizes clustering agreement

Feature Analysis

  • Letter Distribution Analysis: Character frequency patterns
  • Word Contribution Analysis: Most discriminative words per cluster
  • Principal Component Contribution: Feature importance in reduced dimensions

Key Visualizations

Dimensionality Reduction Results

  • PCA_COV.png - PCA with covariance matrix results
  • PCA_LR.png - PCA with linear regression analysis
  • KPCA_POLY.png - Kernel PCA with polynomial kernel
  • KPCA_BRF.png - Kernel PCA with RBF kernel
  • NMF.png - Non-negative Matrix Factorization results
  • UMAP.png - UMAP dimensionality reduction

Clustering Results

  • Kmeans.png - K-means clustering results
  • Kmediods.png - K-medoids clustering results
  • Hierarchical_Clustering.png - Hierarchical clustering dendrogram
  • GMM.png - Gaussian Mixture Model results
  • Laplacian_graph_kmeans.png - Spectral clustering results

Evaluation Metrics

  • K_mediods_silhouette.png - Silhouette analysis for K-medoids
  • k_mediods_consensus_matrix.png - Consensus matrix visualization
  • k_mediods_stability.png - Stability analysis results

Feature Analysis

  • 各个分类的词贡献.png - Word contributions by cluster
  • 排序2PC贡献分析.png - Principal component contribution analysis
  • 全部letter的分布1.png - Overall letter distribution
  • 最简单的分析1.png - Basic analysis results

Technical Implementation

Dependencies

The project likely uses the following Python libraries:

  • scikit-learn: For clustering algorithms and dimensionality reduction
  • numpy: For numerical computations
  • pandas: For data manipulation
  • matplotlib/seaborn: For visualization
  • umap-learn: For UMAP implementation
  • scipy: For hierarchical clustering

Key Features

  1. Comprehensive Analysis: Multiple approaches to understand author clustering
  2. Robust Evaluation: Multiple metrics to assess clustering quality
  3. Visualization: Extensive plotting for result interpretation
  4. Comparative Study: Comparison between different algorithms and methods

Results and Insights

The project provides insights into:

  • Author writing style similarities and differences
  • Optimal number of clusters for author classification
  • Most discriminative features for author identification
  • Stability and reliability of different clustering approaches
  • Effectiveness of various dimensionality reduction techniques

Usage

  1. Open code.ipynb in Jupyter Notebook or Google Colab
  2. Run the cells sequentially to reproduce the analysis
  3. Refer to Brief_Introduction_And_Equation_Derivation.pdf for theoretical background
  4. Check targets.pdf for specific project objectives

Academic Context

This project appears to be part of a Statistical Machine Learning course (5241) at Columbia University, focusing on unsupervised learning techniques and their application to literary analysis.

Future Work

Potential extensions could include:

  • Analysis of temporal writing style evolution
  • Cross-lingual author clustering
  • Integration of semantic features
  • Deep learning approaches for author clustering
  • Real-time author identification systems

This project demonstrates the application of various unsupervised learning techniques to literary analysis, providing valuable insights into author clustering and writing style analysis.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors