Name	Name	Last commit message	Last commit date
Latest commit He Hao add nltk.download('omw-1.4') Sep 27, 2022 59351e5 · Sep 27, 2022 History 58 Commits
.github	.github	Create FUNDING.yml	Jan 1, 2021
test_results	test_results	added unit test folders	May 12, 2020
tests	tests	fixed unit test to handle pyspellchecker bug	Jan 1, 2021
text_preprocessing	text_preprocessing	add nltk.download('omw-1.4')	Sep 27, 2022
.gitignore	.gitignore	init commit	May 12, 2020
DESCRIPTION.rst	DESCRIPTION.rst	minor tweat on README and DESCRIPTION	May 15, 2020
LICENSE	LICENSE	Update LICENSE	May 15, 2020
MANIFEST.in	MANIFEST.in	updated MANIFEST.in	May 12, 2020
Makefile	Makefile	added Makefile	May 12, 2020
README.md	README.md	added udf function to preprocess text in PySpark	Jun 25, 2020
__init__.py	__init__.py	Bump version: 0.1.0 → 0.1.1	Sep 27, 2022
requirements.txt	requirements.txt	specify names-dataset version to 2.1	Sep 27, 2022
setup.cfg	setup.cfg	Bump version: 0.1.0 → 0.1.1	Sep 27, 2022
setup.py	setup.py	Bump version: 0.1.0 → 0.1.1	Sep 27, 2022

Repository files navigation

Text preprocessing for Natural Language Processing

A python package for text preprocessing task in natural language processing.

Usage

To use this text preprocessing package, first install it using pip:

pip install text-preprocessing

Then, import the package in your python script and call appropriate functions:

from text_preprocessing import preprocess_text
from text_preprocessing import to_lower, remove_email, remove_url, remove_punctuation, lemmatize_word

# Preprocess text using default preprocess functions in the pipeline 
text_to_process = 'Helllo, I am John Doe!!! My email is john.doe@email.com. Visit our website www.johndoe.com'
preprocessed_text = preprocess_text(text_to_process)
print(preprocessed_text)
# output: hello email visit website

# Preprocess text using custom preprocess functions in the pipeline 
preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, lemmatize_word]
preprocessed_text = preprocess_text(text_to_process, preprocess_functions)
print(preprocessed_text)
# output: helllo i am john doe my email is visit our website

If you have a lot of data to preprocess, and would like to run text preprocessig in a parallel manner in PySpark on Databricks, please use the following udf function:

from text_preprocessing import preprocess_text
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql import DataFrame as SparkDataFrame


def preprocess_text_spark(df: SparkDataFrame, 
                          target_column: str, 
                          preprocessed_column_name: str = 'preprocessed_text'
                         ) -> SparkDataFrame:
    """ Preprocess text in a column of a PySpark DataFrame by leveraging PySpark UDF to preprocess text in parallel """
    _preprocess_text = udf(preprocess_text, StringType())
    new_df = df.withColumn(preprocessed_column_name, _preprocess_text(df[target_column]))
    return new_df

Features

Feature	Function
convert to lower case	to_lower
convert to upper case	to_upper
keep only alphabetic and numerical characters	keep_alpha_numeric
check and correct spellings	check_spelling
expand contractions	expand_contraction
remove URLs	remove_url
remove names	remove_name
remove emails	remove_email
remove phone numbers	remove_phone_number
remove SSNs	remove_ssn
remove credit card numbers	remove_credit_card_number
remove numbers	remove_number
remove bullets and numbering	remove_itemized_bullet_and_numbering
remove special characters	remove_special_character
remove punctuations	remove_punctuation
remove extra whitespace	remove_whitespace
normalize unicode (e.g., café -> cafe)	normalize_unicode
remove stop words	remove_stopword
tokenize words	tokenize_word
tokenize sentences	tokenize_sentence
substitute custom words (e.g., vs -> versus)	substitute_token
stem words	stem_word
lemmatize words	lemmatize_word
preprocess text through a sequence of preprocessing functions	preprocess_text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text preprocessing for Natural Language Processing

Usage

Features

About

Releases 8

Sponsor this project

Packages

Languages

License

berknology/text-preprocessing

Folders and files

Latest commit

History

Repository files navigation

Text preprocessing for Natural Language Processing

Usage

Features

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 8

Sponsor this project

Packages 0

Languages

Packages