GitHub - elmurod1202/survey-clustering: K-means clustering of texts (survey answers) using word-embeddings, finding optimal elbow-point, and averaging multiple-word expressions.

K-menas Clustering of Survey Answers

Answers of a survey in Spanish are categorized using word-embeddings, and categorized using k-means clustering.
This project also includes dealing with multi-word expressions, by removing stopwords, and obtaining their vector-averages.
This is an example for Spanish language but it can easily be adapted for any other languages.
The number of clusters are obtained by optimal curvage finding algorithm(elbow method).

Table of Contents

About The Project
- Built With
How it works
How to use
License
Acknowledgements

About The Project

This project was created with a purpose to serve people who are searching for a solution to group/categorize words or even multi-word expressions by their meaning. There are so many tools and services to run statistics and/or create diagrams of given data, but they mostly work for numbers, when it comes to deal with words or some texts, those tools seem less useful since they do not include any way to visualize them in 2D/3D space based on their usage/meaning. This repository somewhat helps to perform following operations on texts:

Visualising texts:
- Visualizing single words using word-embedding vectors of a language;
- Visualizing multiple-word texts by obtaining average vecotrs of containing words (stopwords removed for better output quality);
Finding the optimal number of groups/clusters/categories to split words/texts based on their meaning, using Within Cluster Sum of Squares(WCSS) to find a level-oof using elbow method;
Grouping/Clustering texts using k-means clustering algorithm;
Visualizing grouped texts by different colors, using patplotlib.

(back to top)

Built With

Programming language used:

Python

These are the major libraries used inside Python:

(back to top)

How it works

First of all, the code loads the list of words/texts from given file: input/answers.txt (it's called answers in this case, because it was an answers of a particular survey), and it obtains a vectors of those texts. An example diagram would look like this:

Then, the code obtains the optimal number of clusters for given texts to splin into, using an elbow-method. For our example it would look like this:

Lastly, the code categorizes the list of texts into groups by their meaning. The final result would look like this:

(back to top)

Usage

To use this code you should have at least a small understanding of how to run a Python code, with Python installed machine. You should also install above-mentioned necessary framework/libraries into it. There are two ways you can run this code:

Either clone the repo by running the commend below, and run the survey-clustering.py:
```
git clone https://github.com/elmurod1202/survey-clustering.git
```
Or just download only the survey-clustering.py (or survey-clustering-minimum.py if you want minimised working code without graphic visualisations) file and make some small changes like where to read the files from and where to store the results to. That's it.

IMPORTANT: This code uses a Spanish word embeddings vector file that is not inluded here due to its big size. Please download the file into the src/ folder from the link: Spanish word vectors (3.4 GB)

Adapting for other languages:

This code is ontended for Spanish, but it can be adapted to many other languages just by changing two files in the src/ folder:

src/embeddings-l-model.vec : Spanish word vectors file to a word-vector file of any language;
src/spanish-stopwords.txt : Spanish stopwords file replaced by any toher language stopwords.

(back to top)

License

Distributed under the GNU GENERAL PUBLIC LICENSE. See LICENSE.txt for more information.

(back to top)

Acknowledgments

Big shoutouts to Luis for bringing this problem to the table.

We are grateful for these resources and tutorials for making this repository possible:

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
input		input
output		output
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example-k-means.py		example-k-means.py
example-stopword-removal.py		example-stopword-removal.py
survey-clustering-minimum.py		survey-clustering-minimum.py
survey-clustering.py		survey-clustering.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K-menas Clustering of Survey Answers

About The Project

Built With

How it works

Usage

Adapting for other languages:

License

Acknowledgments

About

Releases

Packages

Languages

elmurod1202/survey-clustering

Folders and files

Latest commit

History

Repository files navigation

K-menas Clustering of Survey Answers

About The Project

Built With

How it works

Usage

Adapting for other languages:

License

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages