This repository contains the normalize.py
script which aims to normalize text from a given .txt
file. The normalization process helps in converting text into a consistent and standardized format, making it easier for subsequent analyses, comparisons, or other text processing tasks.
- Convert all characters to lowercase.
- Remove punctuation.
- Remove extra whitespace (e.g., multiple spaces, tabs, newlines).
- Exclude numeral characters.
- Remove common stopwords (like "and", "the", "is").
- Python 3
- NLTK library (Natural Language Toolkit)
-
Clone this repository:
git clone https://github.com/btrinos/text-normalization.git cd text-normalization
-
(Optional) It's a good practice to create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the required Python packages:
pip install nltk
-
Download the NLTK stopwords (you only need to do this once):
import nltk nltk.download('stopwords')
-
Place the
.txt
file you want to normalize in the same directory as thenormalize.py
script. -
Rename the file to
document.txt
or modify the script to read from your filename. -
Run the script:
python normalize.py
-
The normalized content will be saved in
normalized_document.txt
.
Feel free to fork this repository, make changes, and submit pull requests. Any contributions, whether it's refining the algorithm, improving documentation, or adding features, are highly appreciated!
This project is licensed under the MIT License. See LICENSE
for more information.