Skip to content

A python package for text preprocessing task in natural language processing.

License

Notifications You must be signed in to change notification settings

berknology/text-preprocessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

59351e5 · Sep 27, 2022

History

58 Commits
Jan 1, 2021
May 12, 2020
Jan 1, 2021
Sep 27, 2022
May 12, 2020
May 15, 2020
May 15, 2020
May 12, 2020
May 12, 2020
Jun 25, 2020
Sep 27, 2022
Sep 27, 2022
Sep 27, 2022
Sep 27, 2022

Repository files navigation

Text preprocessing for Natural Language Processing

Build Release PyPi

A python package for text preprocessing task in natural language processing.

Usage

To use this text preprocessing package, first install it using pip:

pip install text-preprocessing

Then, import the package in your python script and call appropriate functions:

from text_preprocessing import preprocess_text
from text_preprocessing import to_lower, remove_email, remove_url, remove_punctuation, lemmatize_word

# Preprocess text using default preprocess functions in the pipeline 
text_to_process = 'Helllo, I am John Doe!!! My email is john.doe@email.com. Visit our website www.johndoe.com'
preprocessed_text = preprocess_text(text_to_process)
print(preprocessed_text)
# output: hello email visit website

# Preprocess text using custom preprocess functions in the pipeline 
preprocess_functions = [to_lower, remove_email, remove_url, remove_punctuation, lemmatize_word]
preprocessed_text = preprocess_text(text_to_process, preprocess_functions)
print(preprocessed_text)
# output: helllo i am john doe my email is visit our website

If you have a lot of data to preprocess, and would like to run text preprocessig in a parallel manner in PySpark on Databricks, please use the following udf function:

from text_preprocessing import preprocess_text
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql import DataFrame as SparkDataFrame


def preprocess_text_spark(df: SparkDataFrame, 
                          target_column: str, 
                          preprocessed_column_name: str = 'preprocessed_text'
                         ) -> SparkDataFrame:
    """ Preprocess text in a column of a PySpark DataFrame by leveraging PySpark UDF to preprocess text in parallel """
    _preprocess_text = udf(preprocess_text, StringType())
    new_df = df.withColumn(preprocessed_column_name, _preprocess_text(df[target_column]))
    return new_df

Features

Feature Function
convert to lower case to_lower
convert to upper case to_upper
keep only alphabetic and numerical characters keep_alpha_numeric
check and correct spellings check_spelling
expand contractions expand_contraction
remove URLs remove_url
remove names remove_name
remove emails remove_email
remove phone numbers remove_phone_number
remove SSNs remove_ssn
remove credit card numbers remove_credit_card_number
remove numbers remove_number
remove bullets and numbering remove_itemized_bullet_and_numbering
remove special characters remove_special_character
remove punctuations remove_punctuation
remove extra whitespace remove_whitespace
normalize unicode (e.g., café -> cafe) normalize_unicode
remove stop words remove_stopword
tokenize words tokenize_word
tokenize sentences tokenize_sentence
substitute custom words (e.g., vs -> versus) substitute_token
stem words stem_word
lemmatize words lemmatize_word
preprocess text through a sequence of preprocessing functions preprocess_text