Natural Language Processing

The core component of Natural Language Processing (NLP) is extracting information from human language.

the need to analyze and understand text data is growing everyday. More and more data generated today is free text

Web: blogs, comments, reviews, notes
Social media: messages, hastags, references
Operations: logs, trails
Emails
Voice transcriptions

Volume and lack of structure provide additional challenges to acquire, process, and analyze text.

Document: A collection of sentences that represent a specific fact or entity.

Copus: a collection of similar documents

Main Application Topics

Sentiment analysis
Topic modeling
Text classification
Sentence segmentation or part-of-speech tagging
...

General Pipeline

Raw text
Tokenize - tell the model what to look at
Clean text - remove stop words/ponctuation, stemming, etc.
Vectorize - convert to numeric form
Machine Learning algorithm - fit/train model
Model Selection

Unstructured Data

Binary data, no delimiters, no indication of rows

Cleansing Text

Formatting and standardization (e.g., dates)
Remove punctuation
Remove abbreviations
Case conversion
Remove elements like hashtags

Stop-word Removal

A group of words that carry no meaning by themselvs (in, and, the which)

Not required for analytics
A standard or custom stop-words dictionary can be used.

Stemming

A stem is the base parte of the word, to which affixes can be attached for derivatives.

Stemming keeps only the base word, thus reducing the total words in the corpus.

Though they may have different affixes, words that share the same stem have similar semantic meaning. Stemming is able to determine that 'learned' and 'learning' , though they have different affixes, each contain the same root word 'learn'.

Reduces the corpus of words the model is exposed to
Explicitly correlates words with similar meanings

Lemmatizing

Similar to stemming, but produces a proper root word that belongs to the language

Uses a dictionary to match words to their root word

Process of grouping together the inflected forms of a word so they can be analyzed as a single term, identified by the word's lemma. Using vocabulary analysis of words aiming to remove inflectional endings to return the dictionary form of a word

Stemming vs. Lemmatizing

To goal of both is to condense derived words into their base forms
- Stemming is typically faster as it simply chops off the end of a word using heuristics, without any understanding of the context in which a word is used.
- Lemmatizing is typically more accurate as it uses more informed analysis to create groups of words with similar meaning based on the context aroud the word.

Parts-of_Speech (POS) Tagging

POS tagging involves identifying the part of speech for each word in a corpus
Used for entity recognition, filtering, and sentiment analysis
Parts of speech tagging are used by chatbots to understand natural language and sentiments.

Word	POS	Description
Man	NN	Noun
Engage	VBP	Verb Singular Present
Top	JJ	Adjective

Vectorization

Raw text needs to be converted to numbers so that Python and the algorithms used for machine learning can understand.

Vectorizing: Process of encoding text as integers to create feature vectors.

Feature vector: An n-dimensional vector of numerical features that represent some object.

Vectorizers should be fit on the training set and only be used to transform the test set.

Types

Count vectorization

N-grams

Creates a document-term matrix where counts still occupy the cell but instead of the columns representing single terms, they represent all combinations of adjacent words of length n in your text.

Ex: "NLP is an interesting topic"

n	Name	Tokens
2	bigram	['NLP is','is an','an interesting','interesting topic']
3	trigram	['NLP is an','is an interesting','an interesting topic']
4	four-gram	['NLP is an interesting','is an interesting topic']

TF-IDF: Term frequency - inverse document frequency $$w_{i,f} = tf_{i,j} *log (\frac{N}{df_i})$$

$td_{i,j}$ = number of times $i$ occurs in $j$ divided by total number of terms in $j$ $df_i$ = number of documents containing $i$ $N$ = total number of documents

Feature Engineering

Creating new features or transforming your existing features to get the most out of your data.

Creating New Features

Length of text field
Percentage of characters that are punctuation in the text
Percentage of characters that are capitalized

Transformations

Process that alters each data point in a certain column in a systematic way

Power transformations (e.g., $-x^2$, $√{x}$)
- Transformation Process
  1. Determine what range of exponents to test.
  2. Apply each transformation to each value of your chosen feature.
  3. Use some criteria to determine which of the transformations yield the best distribution.
- Box-Cox Power Transformations (Base Form: $y^x$)
  
  X Base Form Transformation
  
  -2 $y^-2$ $\frac{1}{y²}$
  
  -1 $y^-1$ $\frac{1}{y}$
  
  -0.5 $y^-1/2$ $\frac{1}{√y}$
  
  0 $y^0$ $log(y)$
  
  0.5 $y^1/2$ $√y$
  
  1 $y^1$ $y$
  
  2 $y^2$ $y²$
Standardizing data

Machine Learning

"The field of study that gives computers the ability to learn without being explicitly programmed." (Arthur Samuel, 1959)

"A computer program is said to learn from experience E with respect to some task T and som performance measure P, if its performance on T, as measured by P, improves with experience E." (Tom Mitchell, 1998)

"Algorithms that 'can figure out how to perform important tasks by generalizing from examples'" (University of Washington, 2012)

"Practice of using algorithms to parse data, learn from it, and then make a determination or prediction about something in the world" (NVIDIA, 2016)

Two Broad Types of Machine Learning:

Supervised Learning: Inferring a function from labeled training data to make predictions on unseen data

Unsupervised Learning: Deriving structure from data where we don't know the effect of any of the variables

Holdout Test Set

Sample of data not used in fitting a model for the purpuse of evaluating the model's ability to generalize unseen data

K-Fold Cross-Validation: The full data set is divided into k-subsets and the holdout method is repeated k times. Each time, one of the k-subsets is used as the test set and the other k-1 subsets are put together to be used to train the model.

Evaluation Metrics

$$Accuracy = \frac{# predicted\ correctly}{total\ #\ of\ observations}$$ $$Precision = \frac{#\ predicted\ as\ spam\ that\ are\ actually\ spam}{total\ #\ predicted\ as\ spam}$$ $$Recall = \frac{#\ predicted\ as\ spam\ that\ are\ actually\ spam}{total\ #\ that\ are\ actually\ spam}$$

Ensemble Method

Techinique that creates multiple models and then combines them to produce better results than any of the single models individually.

Random Forest

Ensemble learning method that constructs a collection of decision tress and then aggregates the predictions of each tree to determine the final prediction

Can be used for classification or regression
Easily handles outliers, missing values, etc.
Accepts various types of inputs (continuous, ordinal,etc.)
Less likely to overfit
Outputs feature importance

Grid-search: Exhaustively search all paramenters combinations in a given grid to determine the best model

Cross-validation: Divide a dataset into k subsets and repeat the holdout method k times where a different subset is used as the holdout set in each iteration.

Gradient Boosting

Ensemble learning method that takes an iterative approach to combining wak learners to create a strong learner by focusing on mistakes of prior iterations

Trade-offs of Gradient Boosting

Pros
- Extremely powerful
- Accepts various types of inputs
- Can be used for classification or regression
- Outputs feature importance
Cons
- Longer to train (can't parallize)
- More likely to overfit
- More difficult to properly tune

Random Forest vs. Gradient Boosting

Both are ensemble methods based on decision tress.

Random Forest	Gradient Boosting
Bagging	Boosting
Training done in parallel	Training done iteratively
Unweighted voting for final prediction	Weighted voting for final prediction
Easier to tune, harder to overfit	Harder to tune, easier to overfit

Model Selection

Process

Split the data into training and test set.
Train vectorizers on training set and use that to transform test set.
Fit best random forest model and best gradient boosting model on training set and predict on test set.
Thoroughly evaluate results of these two models to select best model

Further evaluation:

- Slice test set
- Examine text messages the model is getting wrong

Results trade-off: consider business context

- Is predict time of 0.213 vs. 0.135 going to create a bottleneck?
- Precision/recall
    + Spam filter - optimize for precision
    + Antivirus software - optimize for recall

Embeddings

word2vec

is a shallow, two-layer neural network that accepts a text corpus as an input, and it returns a set of vectors (also known as embeddings); each vector is a numeric representation of a given word.

"You shall know a word by the company it keeps."

gensim package pre-trained Embeddings:
- glove-twitter-{25/50/100/200}
- glove-wiki-gigaword-{50/200/300}
- word2vec-google-news-300
- word2vec-ruscorpora-news-300

doc2vec

is a shallow, two-layer neural network that accepts a text corpus as an input, and it returns a set of vectors (aka embeddings); each vector is a numeric representation of a given sentence, paragraph, or document.

Pre-trained Document Vectors There are not as many options as there are for word vectors. There also is not an easy API to read these in like there is for word2vec so it is more time consuming.

Pre-trained vectors from training on Wikipedia and Associated Press News can be found [here](https://github.com/jhlau/doc2vec).

Recurrent Neural Network

Pattern matching through the connection of many very simple functions to create one very powerful functino; this functin has an understanding of the data's sequential nature (using feedback loops that form a sense of memory)

Best Practices

Storing Text Data

Use suitable free-format big-data storage for text (HDFS, S3, or Google Cloud Storage)
Create indexes on key data elements for easy access (MongoDB, Elasticsearch)
Store processd text like tokens and TF-IDF

Processing text data

Filter text as early as possible in the processing cycle
Use an exhaustive and context-specific stop-word list
Identify domain-specific data for special use
Eliminate data with low frequency
Build a clean and indexed corpus

Scalable processing of text data

Use technologies that allow parallel access and storage (Kafka, HDFS, MongoDB, and so on)
Oricess eachg document independently with map() functions (in Hadoop or Apache Spark)
Use reduce() functions late in the processing cycle

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
NLP with TensorFlow		NLP with TensorFlow
basics		basics
classification		classification
clustering		clustering
data		data
predictive_text		predictive_text
sentiment_analysis		sentiment_analysis
word_cloud		word_cloud
:memmory:		:memmory:
:memmory:-journal		:memmory:-journal
README.md		README.md
natural_language_processing_overview.png		natural_language_processing_overview.png
nlp_diagram.py		nlp_diagram.py
overview_diagram		overview_diagram

X	Base Form	Transformation
-2	$y^-2$	$\frac{1}{y²}$
-1	$y^-1$	$\frac{1}{y}$
-0.5	$y^-1/2$	$\frac{1}{√y}$
0	$y^0$	$log(y)$
0.5	$y^1/2$	$√y$
1	$y^1$	$y$
2	$y^2$	$y²$

FernandoMarcon/NLP

Folders and files

Latest commit

History

Repository files navigation