Text Normalization Project

This repository contains the normalize.py script which aims to normalize text from a given .txt file. The normalization process helps in converting text into a consistent and standardized format, making it easier for subsequent analyses, comparisons, or other text processing tasks.

Features

Convert all characters to lowercase.
Remove punctuation.
Remove extra whitespace (e.g., multiple spaces, tabs, newlines).
Exclude numeral characters.
Remove common stopwords (like "and", "the", "is").

Dependencies

Python 3
NLTK library (Natural Language Toolkit)

Installation

Clone this repository:

git clone https://github.com/btrinos/text-normalization.git
cd text-normalization

(Optional) It's a good practice to create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install the required Python packages:
```
pip install nltk
```
Download the NLTK stopwords (you only need to do this once):
```
import nltk
nltk.download('stopwords')
```

Usage

Place the .txt file you want to normalize in the same directory as the normalize.py script.
Rename the file to document.txt or modify the script to read from your filename.
Run the script:
```
python normalize.py
```
The normalized content will be saved in normalized_document.txt.

Contributing

Feel free to fork this repository, make changes, and submit pull requests. Any contributions, whether it's refining the algorithm, improving documentation, or adding features, are highly appreciated!

License

This project is licensed under the MIT License. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
normalize.py		normalize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Normalization Project

Features

Dependencies

Installation

Usage

Contributing

License

About

Releases

Packages

Languages

License

btrinos/text-normalization

Folders and files

Latest commit

History

Repository files navigation

Text Normalization Project

Features

Dependencies

Installation

Usage

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages