GitHub - anubane/clustering20newsgroup: Applying K-Means algorithm to the 20 newsgroup data (link in readme)

Clustering 20 newsgroup data

Applying K-Means algorithm to the 20 newsgroup data (link: http://qwone.com/~jason/20Newsgroups/)

Step 1: From the already classified data, I have selected 30 from each group and mixed them under directory workdata
Step 2: Using tf-idf model, extract features for the documents and form the document vectors
Step 3: Apply K-Means algo to these n-dim doc vectors
Step 4: The labels generated for the cluster membership and the doc vectors are passed to the LDA function
Step 5 The linear discriminant analysis process reduces the n-dim to 2-dim and plots the clusters via a scatter plot (Cluster_Visualization.png)
Step 6 The doc list of each cluster is also output (sample available in Clusters_of_Docs.txt)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
Cluster_Visualization.png		Cluster_Visualization.png
Clusters_of_Docs.txt		Clusters_of_Docs.txt
LICENSE		LICENSE
README.md		README.md
assign6.py		assign6.py
feature.py		feature.py
model.py		model.py
stopwords		stopwords