This project contains a Python script that parses an HTML file of the Bible and creates a treemap visualization of the most common words and phrases.
Here’s a high-level idea of how this could be done:
-
Preprocess the text: This could involve cleaning the text, removing stop words, and possibly lemmatizing words.
-
Convert sentences into vectors: Use an NLP model to convert each sentence into a vector. This could be a simple Bag-of-Words model, TF-IDF, or more complex models like Word2Vec, GloVe, BERT, etc.
-
Calculate similarity: For each sentence, calculate its similarity to all other sentences. This could be done using cosine similarity, which is a common measure for the similarity between vectors.
-
Group sentences: Based on their similarities, group sentences together. This could be done using a clustering algorithm like K-means.
-
Count groups: Instead of counting identical sentences, count the number of sentences in each group.
These instructions will get you a copy of the project up and running on your local machine.
You need to have Python installed on your machine. You also need the following Python libraries:
- BeautifulSoup
- collections
- re
- matplotlib
- squarify
You can install these libraries using pip:
pip install beautifulsoup4 matplotlib squarify
To run the script, navigate to the directory containing the script and run the following command:
python words.py
or
python phrases.py
Charlie
This project is licensed under the MIT License.
Thanks to OpenAI for providing the initial guidance for this project.