This project implements various unsupervised learning techniques to cluster and analyze literary authors based on their writing styles and characteristics. The analysis includes dimensionality reduction methods, clustering algorithms, and comprehensive evaluation metrics.
Unsupervised Clustering_of_Literary_Authors/
├── Brief_Introduction_And_Equation_Derivation.pdf # Theoretical background and equations
├── code.ipynb # Main Jupyter notebook with implementation
├── code_colab.pdf # Code documentation
├── targets.pdf # Project objectives and targets
└── pics/ # Generated visualizations
├── a/ # Dimensionality reduction and analysis plots
├── b/ # Clustering algorithm results
├── c/ # Feature analysis and contributions
├── d/ # Evaluation metrics and stability analysis
└── second_problem_b/ # Additional problem analysis
The project implements several dimensionality reduction methods to analyze author characteristics:
- PCA with Covariance Matrix: Standard PCA implementation using covariance matrix
- PCA with Linear Regression: PCA combined with linear regression analysis
- Kernel PCA:
- Polynomial Kernel: Non-linear dimensionality reduction using polynomial kernels
- RBF Kernel: Radial Basis Function kernel for non-linear transformations
- Topic modeling approach to extract author writing patterns
- Analysis of topic vector weights and contributions
- Modern dimensionality reduction technique for preserving local and global structure
- Simultaneous clustering of authors and features
- K-means: Standard centroid-based clustering
- K-medoids: Medoid-based clustering for robustness
- Hierarchical Clustering:
- Average Linkage: Average distance between clusters
- Complete Linkage: Maximum distance between clusters
- Ward Method: Minimizes within-cluster variance
- Gaussian Mixture Models (GMM): Probabilistic clustering approach
- Laplacian Graph K-means: Spectral clustering using graph Laplacian
- Silhouette Score: Measures clustering quality and separation
- Stability Analysis: Assesses clustering consistency across different runs
- Consensus Matrix: Visualizes clustering agreement
- Letter Distribution Analysis: Character frequency patterns
- Word Contribution Analysis: Most discriminative words per cluster
- Principal Component Contribution: Feature importance in reduced dimensions
PCA_COV.png- PCA with covariance matrix resultsPCA_LR.png- PCA with linear regression analysisKPCA_POLY.png- Kernel PCA with polynomial kernelKPCA_BRF.png- Kernel PCA with RBF kernelNMF.png- Non-negative Matrix Factorization resultsUMAP.png- UMAP dimensionality reduction
Kmeans.png- K-means clustering resultsKmediods.png- K-medoids clustering resultsHierarchical_Clustering.png- Hierarchical clustering dendrogramGMM.png- Gaussian Mixture Model resultsLaplacian_graph_kmeans.png- Spectral clustering results
K_mediods_silhouette.png- Silhouette analysis for K-medoidsk_mediods_consensus_matrix.png- Consensus matrix visualizationk_mediods_stability.png- Stability analysis results
各个分类的词贡献.png- Word contributions by cluster排序2PC贡献分析.png- Principal component contribution analysis全部letter的分布1.png- Overall letter distribution最简单的分析1.png- Basic analysis results
The project likely uses the following Python libraries:
- scikit-learn: For clustering algorithms and dimensionality reduction
- numpy: For numerical computations
- pandas: For data manipulation
- matplotlib/seaborn: For visualization
- umap-learn: For UMAP implementation
- scipy: For hierarchical clustering
- Comprehensive Analysis: Multiple approaches to understand author clustering
- Robust Evaluation: Multiple metrics to assess clustering quality
- Visualization: Extensive plotting for result interpretation
- Comparative Study: Comparison between different algorithms and methods
The project provides insights into:
- Author writing style similarities and differences
- Optimal number of clusters for author classification
- Most discriminative features for author identification
- Stability and reliability of different clustering approaches
- Effectiveness of various dimensionality reduction techniques
- Open
code.ipynbin Jupyter Notebook or Google Colab - Run the cells sequentially to reproduce the analysis
- Refer to
Brief_Introduction_And_Equation_Derivation.pdffor theoretical background - Check
targets.pdffor specific project objectives
This project appears to be part of a Statistical Machine Learning course (5241) at Columbia University, focusing on unsupervised learning techniques and their application to literary analysis.
Potential extensions could include:
- Analysis of temporal writing style evolution
- Cross-lingual author clustering
- Integration of semantic features
- Deep learning approaches for author clustering
- Real-time author identification systems
This project demonstrates the application of various unsupervised learning techniques to literary analysis, providing valuable insights into author clustering and writing style analysis.