Natural-Language-Processing

In the recent years, Natural Language Processing (NLP) has experienced fast growth primarily due to the performance of the language models:

GPT-3
BERT

Classical NLP architectures:

Bag-of-words (BoW)
Word embeddings
Recurrent neural networks
Generative networks.

Task in NLP:

Text classification
Intent classification
Sentiment analysis
Named Entity Recognition
Keyword extraction
Text Summarization
Question/Answer

Datasets:

AG_NEWS
IMDB

If we want to solve Natural Language Processing (NLP) tasks with neural networks, we need some way to represent text as tensors. Computers already represent textual characters as numbers that map to fonts on your screen using encodings such as ASCII or UTF-8.

We can use different approaches when representing text:

Character-level representation: Each letter would correspond to a tensor column in one-hot encoding.
Word-level representation when we create a vocabulary of all words in our text sequence or sentence(s), and then represent each word using one-hot encoding.

To unify those approaches, we typically call an atomic piece of text a token. In some cases tokens can be letters, in other cases - words, or parts of words. The process of converting text into a sequence of tokens is called tokenization. Next, we need to assign each token to a number, which we can feed into a neural network. This is called vectorization, and is normally done by building a token vocabulary. For this we can use this function in PyTorch:

torchtext.data.utils.get_tokenizer
torchtext.vocab.vocab
collection.Count
encode and decode with vocab.get_stoi()[string]

Now, to build text classification model, we need to feed the whole sentence (or whole news article) into a neural network. The problem here is that each article/sentence has variable length; and all fully-connected or convolution neural networks deal with fixed input size. There are two ways we can handle this problem:

Find a way to collapse a sentence into fixed-length vector *Bag-of-Words and TF-IDF representations.
Design special neural network architectures that can deal with variable length sequences *Recurrent neural networks (RNN).

Bag of Words and TF-IDF

Bag of Words (BoW)

Vector representation is the most commonly used traditional vector representation. Each word is linked to a vector index, vector element contains the number of occurrences of a word in a given document.

In BoW representation, word occurrences are evenly weighted, regardless of the word itself. However, it is clear that frequent words, such as 'a', 'in', 'the' etc. are much less important for the classification, than specialized terms. In fact, in most NLP tasks some words are more relevant than others.

TF-IDF

Stands for term frequency–inverse document frequency. It is a variation of bag of words, where instead of a binary 0/1 value indicating the appearance of a word in a document, a floating-point value is used, which is related to the frequency of word occurrence in the corpus.

The formula to calculate TF-IDF is: $w_{ij} = tf_{ij}\times\log({N\over df_i})$

Here's the meaning of each parameter in the formula:

$i$ is the word
$j$ is the document
$w_{ij}$ is the weight or the importance of the word in the document
$tf_{ij}$ is the number of occurrences of the word $i$ in the document $j$, i.e. the BoW value we have seen before
$N$ is the number of documents in the collection
$df_i$ is the number of documents containing the word $i$ in the whole collection

TF-IDF value $w_{ij}$ increases proportionally to the number of times a word appears in a document and is offset by the number of documents in the corpus that contains the word, which helps to adjust for the fact that some words appear more frequently than others. For example, if the word appears in every document in the collection, $df_i=N$, and $w_{ij}=0$, and those terms would be completely disregarded.

Even though TF-IDF representation calculates different weights to different words according to their importance, it is unable to correctly capture the meaning, largely because the order of words in the sentence is still not taken into account.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
BoW_TF-IDF_PyTorch.ipynb		BoW_TF-IDF_PyTorch.ipynb
GenerateText_PyTorch.ipynb		GenerateText_PyTorch.ipynb
README.md		README.md
RNN_PyTorch.ipynb		RNN_PyTorch.ipynb
WordEmbeddings_PyTorch.ipynb		WordEmbeddings_PyTorch.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural-Language-Processing

Bag of Words and TF-IDF

Bag of Words (BoW)

TF-IDF

Word embeddings

Recurrent Neural Networks

Generate text with RNN

Pre-trained models

About

Releases

Packages

Languages

galois1915/Natural-Language-Processing

Folders and files

Latest commit

History

Repository files navigation

Natural-Language-Processing

Bag of Words and TF-IDF

Bag of Words (BoW)

TF-IDF

Word embeddings

Recurrent Neural Networks

Generate text with RNN

Pre-trained models

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages