vignette-clustering-methods

Vignette on implementing clustering methods (hierarchical, model-based, density-based) using unlabeled country data; created as a class project for PSTAT197A in Fall 2022.

Contributors

KunXiao Gao, Justin Liu, Ruoxin Wang, Kassandra Trejo

Vignette Abstract

Clustering refers to the idea of partitioning observations from a data set into distinct groups without being given the labels beforehand. As an unsupervised learning technique, the goal of clustering is not to generate predictions but rather to draw inferences from the data. For our topic, we specialized in three different types of clustering methods. In hierarchical clustering, the distance between observations determines which cluster each observation falls into – we use the Euclidean distance as our metric. In model-based clustering, clusters are formed based on a probability distribution – we demonstrate this using Gaussian mixture models. In density-based clustering, the data is grouped in areas where many points are close together – we use DBSCAN to illustrate this. Unlike model-based clustering, density-based clustering is a non-parametric method since it does not assume that the points come from a predetermined probability distribution. We implemented these 3 methods to perform unsupervised classification on an unlabeled country data set. Overall, we found that model-based clustering gave us the most detailed clusters while still maintaining a good level of interpretability.

Repository Contents

The vignette files (vignette.Rmd and vignette.html) can be found in the root directory of this repository. The vignette-clustering-methods.Rproj file opens the R project and sets the working directory.

The data folder includes the raw country data set used in the vignette (country-data.csv) and its corresponding codebook (data-dictionary.csv), both of which were downloaded from Kaggle.

The scripts folder includes a script containing all of the code from the vignette (vignette-script.R) as well as a drafts subfolder containing any drafts of our code.

The img folder includes images that we utilized in our vignette.

Instructions

Clone the repository or download it as a ZIP file. Once it is on your local machine, simply click on vignette.html to view the vignette in your web browser. To run the code in vignette.Rmd and scripts/vignette-script.R, click on vignette-clustering-methods.Rproj beforehand to set the working directory.

References

For further references on the clustering methods mentioned in this vignette, there are many websites and textbooks that provide extensive information on these topics. Here are a few that we accessed to help us:

Data set
- [Data] "Unsupervised Learning on Country Data" | Kaggle
Hierarchical clustering
- [Textbook] An Introduction to Statistical Learning, Chapter 10.3.2
- [Article] "How Many Clusters?" | Satoru Hayasaka
Model-based clustering
Density-based clustering
Analysis
- [Article] "The Demographic Transition Model" | Prateek Agarwal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vignette-clustering-methods

Contributors

Vignette Abstract

Repository Contents

Instructions

References

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
data		data
img		img
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
vignette-clustering-methods.Rproj		vignette-clustering-methods.Rproj
vignette.Rmd		vignette.Rmd
vignette.html		vignette.html

License

pstat197/vignette-clustering-methods

Folders and files

Latest commit

History

Repository files navigation

vignette-clustering-methods

Contributors

Vignette Abstract

Repository Contents

Instructions

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages