An NLP project to cluster YouTube comments on the basis of their similarity of words
An NLP Project in Python3 that clusters YouTube comments made on a particular video into distinct groups on the basis of their similarity of words, and visualises the results using wordclouds and a bar graph plot; primarily using techniques like k-Means clustering and the tf-idf.
Sample word clouds and bar graph plot to analyse the clustered comments' data; comments from this video
This video whipped up the inspiration within me to create something like this, sometime in the future. And who knew this was the best time to begin fulfulling this long held longing!
Pondering for a few days had hit me up with this idea to cluster YouTube comments.
Asked Why? 🤔
- Firstly it could help one identify the genre of comments that were made the most on a particular video, and
- Secondly how many people resonated with them (i.e. which kind of comments were liked the most)
A simple yet an effective way to analyse people's reviews and opinions on a particular video. Sounds fair and square?
Click here to navigate to the USAGE.md
file and go through the steps to make use of this project by yourself!
This was my first NLP project, that too in Python!
It was a nice experience learning about the basics of What NLP is, the NLP pipeline, Text pre-processing and representation, and to use these concepts in actual code.
One of the resources (in Hindi) I found really helpful was this YouTube playlist, these videos were really insightful and helped me understand my requirements and plan of action along the making of this project.
Not only did I get familiarized with the basics of pandas
, but a part of this project also focused majorly on how to fetch the YouTube comments using the Google API. Trying to code that, along with a couple of documentations, references and resources available online, turned out to be a profound adventure on it's own.
I have tried what I could to structure the code nicely; had also spent considerable time to speed up the text-preprocessing times. However, if one could help out with a better code or overall project organisation, or more optimised methods in various parts of the project, that would be highly appreciated!
Even README contributions would be of profound help!
I hope you found this project, and it's explanation valuable. Let me know about anything that could be made better. Thanks for your time!