You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After following this lesson, learners will be able to:
19
19
20
-
- Implement a full preprocessing pipeline on a text
21
-
- Use Word2Vec to train a model
22
-
- Inspect word embeddings
20
+
- Implement a basic NLP Pipeline
21
+
- Build a Document-Term Matrix
22
+
- Understand the concept of word embeddings
23
+
- Use and Explore Word2Vec models
24
+
- Use word vectors as features for a classifier
23
25
:::
24
26
25
27
## Introduction
@@ -49,7 +51,7 @@ We have already done some basic data pre-processing in the introduction. Here we
49
51
```
50
52
51
53
::: callout
52
-
- Preprocessing approaches affect significantly the quality of the training when working with word embeddings. For example, [Rahimi & Homayounpour (2022)] (<https://link.springer.com/article/10.1007/s10579-022-09620-5>) demonstrated that for text classification and sentiment analysis, the removal of punctuation and stopwords leads to higher performance.
54
+
- Preprocessing approaches affect significantly the quality of the training when working with word embeddings. For example, [Rahimi & Homayounpour (2022)](https://link.springer.com/article/10.1007/s10579-022-09620-5) demonstrated that for text classification and sentiment analysis, the removal of punctuation and stopwords leads to higher performance.
53
55
54
56
- You do not always need to do all the preprocessing steps, and which ones you should do depends on what you want to do. For example, if you want to segment text into sentences then characters such as '.', ',' or '?' are the most important; if you want to extract Named Entities from text, you explicitly do not want to lowercase the text, as capitals are a component in the identification process, and if you are interested in gender bias you definitely want to keep the pronouns, etc...
This was the end of our preprocessing step. Let's look at what tokens we have extracted and how frequently they occur in the text.
141
-
142
-
```python
143
-
import matplotlib.pyplot as plt
144
-
from collections import Counter
145
-
146
-
# count the frequency of occurrence of each token
147
-
token_counts = Counter(tokens_no_punct)
148
-
149
-
# get the top n most common tokens (otherwise the plot would be too crowded) and their relative frequencies
150
-
most_common = token_counts.most_common(100)
151
-
tokens = [item[0] for item in most_common]
152
-
frequencies = [item[1] for item in most_common]
153
-
154
-
plt.figure(figsize=(12, 6))
155
-
plt.bar(tokens, frequencies)
156
-
plt.xlabel('Tokens')
157
-
plt.ylabel('Frequency')
158
-
plt.title('Token Frequencies')
159
-
plt.xticks(rotation=90)
160
-
plt.tight_layout()
161
-
plt.show()
162
-
```
163
-
164
-
As one can see, words in the text have a very specific [skewed distribution](https://link.springer.com/article/10.3758/s13423-014-0585-6), such that there are few very high-frequency words that account for most of the tokens in text (e.g., articles, conjunctions) and many low frequency words.
165
140
166
141
### 5. Stop word removal
167
142
@@ -182,7 +157,7 @@ This is also sometimes known as a bag-of-words as it ignores grammar and word se
182
157
- Doc 3: “Language processing with computers is NLP”
183
158
- Doc 4: "Today it rained a lot"
184
159
185
-
| Term | Doc1 | Doc2 | Doc3 |Doc3|
160
+
| Term | Doc1 | Doc2 | Doc3 |Doc4|
186
161
|------------|------|------|------|------|
187
162
| natural | 1 | 1 | 0 | 0 |
188
163
| language | 1 | 1 | 1 | 0 |
@@ -204,7 +179,7 @@ We can represent each document by taking its column and treating it as a vector
204
179
205
180
## What are word embeddings?
206
181
207
-
A Word Embedding is a word representation type that maps words in a numerical manner (i.e., into vectors) in a multidimensional space, capturing their meaning based on characteristics or context. Since similar words occur in similar contexts, or have same characteristics, the system automatically learns to assign similar vectors to similar words.
182
+
A Word Embedding is a word representation type that maps words in a numerical manner (i.e., into vectors) in a multidimensional space, capturing their meaning based on characteristics or context. Since similar words occur in similar contexts, or have same characteristics, a properly trained model will learn to assign similar vectors to similar words.
208
183
209
184
Let's illustrate this concept using animals. This example will show us an intuitive way of representing things into vectors.
210
185
@@ -266,152 +241,93 @@ The higher similarity score between the cat and the dog indicates they are more
266
241
267
242

268
243
269
-
:::: challenge
270
-
- Add one of two other dimensions. What characteristics could they map?
271
-
- Add another animal and map their dimensions
272
-
- Compute again the cosine similarity among those animals and find the couple that is the least similar and the most similar
273
-
274
-
::: solution
275
-
1. Add one of two other dimensions
276
-
277
-
We could add the dimension of "velocity" or "speed" that goes from 0 to 100 meters/second.
278
-
279
-
- Caterpillar: 0.001 m/s
280
-
- Cat: 1.5 m/s
281
-
- Dog: 2.5 m/s
282
244
283
-
(just as an example, actual speeds may vary)
284
-
285
-
```python
286
-
cat = np.asarray([[70, 4, 1.5]])
287
-
dog = np.asarray([[56, 4, 2.5]])
288
-
caterpillar = np.asarray([[70, 100, .001]])
289
-
```
290
-
291
-
Another dimension could be weight in Kg:
292
-
293
-
- Caterpillar: .05 Kg
294
-
- Cat: 4 Kg
295
-
- Dog: 15 Kg
245
+
By representing words as vectors with multiple dimensions, we capture more nuances of their meanings or characteristics.
296
246
297
-
(just as an example, actual weight may vary)
247
+
## Explore the Word2Vec Vector Space
298
248
299
-
```python
300
-
cat = np.asarray([[70, 4, 1.5, 4]])
301
-
dog = np.asarray([[56, 4, 2.5, 15]])
302
-
caterpillar = np.asarray([[70, 100, .001, .05]])
303
-
```
249
+
There are two main architectures for training Word2Vec:
304
250
305
-
Then the cosine similarity would be:
251
+
- Continuous Bag-of-Words (CBOW): Predicts a target word based on its surrounding context words.
252
+
- Continuous Skip-Gram: Predicts surrounding context words given a target word.
306
253
307
-
```python
308
-
cosine_similarity(cat, caterpillar)
254
+

309
255
310
-
cosine_similarity(cat, dog)
311
-
```
256
+
::: callout
257
+
CBOW is faster to train, while Skip-Gram is more effective for infrequent words. Increasing context size improves embeddings but increases training time.
258
+
:::
312
259
313
-
Output:
260
+
We will be using CBOW. We are interested in having vectors with 300 dimensions and a context size of 5 surrounding words. We include all words present in the corpora, regardless of their frequency of occurrence and use 4 CPU cores for training. All these specifics are translated in only one line of code.
314
261
315
-
```python
316
-
array([[0.61814254]])
317
-
array([[0.97893809]])
318
-
```
319
262
320
-
2. Add another animal and map their dimensions
321
263
322
-
Another animal that we could add is the Tarantula!
264
+
We can inspect already what's the output of this training, by checking the top 5 most similar words to "maan" (moon):
323
265
324
266
```python
325
-
cat = np.asarray([[70, 4, 1.5, 4]])
326
-
dog = np.asarray([[56, 4, 2.5, 15]])
327
-
caterpillar = np.asarray([[70, 100, .001, .05]])
328
-
tarantula = np.asarray([[80, 6, .1, .3]])
267
+
word_vectors.most_similar('maan', topn=5)
329
268
```
330
269
331
-
3. Compute again the cosine similarity among those animals - find out the most and least similar couple
332
-
333
-
Given the values above, the least similar couple is the dog and the caterpillar, whose cosine similarity is `array([[0.60855407]])`.
334
-
335
-
The most similar couple is the cat and the tarantula: `array([[0.99822302]])`
336
-
:::
337
-
::::
338
-
339
-
By representing words as vectors with multiple dimensions, we capture more nuances of their meanings or characteristics.
- We can represent text as vectors of numbers (which makes it interpretable for machines)
343
-
- The most efficient and useful way is to use word embeddings
344
-
- We can easily compute how words are similar to each other with the cosine similarity
345
-
:::
272
+
### Load the embeddings and inspect them
346
273
347
-
When semantic change occurs, words in their context *also* change. We can trace how a word evolves semantically over time through comparison of that word with other similar words. The idea is that the most similar words are not always fixed in each different year, if a word acquires a new meaning.
274
+
We proceed to load our models. We will load all pre-trained model files from the original Word2Vec paper, which was trained on a big corpus from Google News. The library `gensim` contains a method called `KeyedVectors` which allows us to load them.
348
275
349
-
## Train the Word2Vec model
276
+
Put here the cod eform the notebook... with simple w2v operations, load existing vectors, test analogy, load neoghbors etc...
350
277
351
-
### Load the embeddings and inspect them
352
278
353
-
We proceed to load our models. We will load all pre-trained model files from the original Word2Vec paper, which was trained on a big corpus from Google News. The library `gensim` contains a method called `KeyedVectors` which allows us to load them.
279
+
### Use Word2Vec vectors as features for a classifier
354
280
355
-
### Prepare the data to be ingested by the model (preprocessing)
281
+
TODO: Here step-bystep a very simple sentiment logistic regression classifier using word2vec as input.
282
+
Maybe this is too advanced ???
356
283
357
-
Also, the decision to remove or retain these parts of text is quite crucial for training our model, as it affects the quality of generated word vectors.
358
284
359
285
::: callout
360
-
###Dataset size in training
286
+
## Dataset size in training
361
287
362
-
To obtain high-quality embeddings, the size/length of your training dataset plays a crucial role. Generally [tens of thousands of documents](https://cs.stanford.edu/~quocle/paragraph_vector.pdf) are considered a reasonable amount of data for decent results.
288
+
To obtain your own high-quality embeddings, the size/length of the training dataset plays a crucial role. Generally [tens of thousands of documents](https://cs.stanford.edu/~quocle/paragraph_vector.pdf) are considered a reasonable amount of data for decent results.
363
289
364
290
Is there however a strict minimum? Not really. Things to keep in mind is that `vocabulary size`, `document length` and `desired vector size` interacts with each other. The higher the dimensional vectors (e.g. 200-300 dimensions) the more data is required, and of high quality, i.e. that allows the learning of words in a variety of contexts.
365
291
366
292
While word2vec models typically perform better with large datasets containing millions of words, using a single page is sufficient for demonstration and learning purposes. This smaller dataset allows us to train the model quickly and understand how word2vec works without the need for extensive computational resources.
367
293
:::
368
294
369
-
For the purpose of this episode and to make training easy on our laptop, we'll train our word2vec model using **just one book**. Subsequently, we'll load pre-trained models for tackling our task.
370
295
371
-
Now we will train a two-layer neural network to transform our tokens into word embeddings. We will be using the library `gensim` and the model we will be using is called `Word2Vec`, developed by Tomas Mikolov et al. in 2013.
296
+
:::: challenge
297
+
## Train your own Word2Vec model
372
298
373
-
Import the necessary libraries:
299
+
1. Load the necessary libraries. See the Gensim [documentation](https://radimrehurek.com/gensim/models/word2vec.html)
There are two main architectures for training Word2Vec:
314
+
2. Prepare the data
387
315
388
-
- Continuous Bag-of-Words (CBOW): Predicts a target word based on its surrounding context words.
389
-
- Continuous Skip-Gram: Predicts surrounding context words given a target word.
390
-
391
-

392
-
393
-
::: callout
394
-
CBOW is faster to train, while Skip-Gram is more effective for infrequent words. Increasing context size improves embeddings but increases training time.
395
-
:::
396
-
397
-
We will be using CBOW. We are interested in having vectors with 300 dimensions and a context size of 5 surrounding words. We include all words present in the corpora, regardless of their frequency of occurrence and use 4 CPU cores for training. All these specifics are translated in only one line of code.
398
-
399
-
Let's train our model then:
316
+
3. Train your own model then:
400
317
401
318
```python
402
319
model = Word2Vec([tokens_no_stopwords], vector_size=300, window=5, min_count=1, workers=4, sg=0)
403
320
```
404
321
405
-
We can inspect already what's the output of this training, by checking the top 5 most similar words to "maan" (moon):
We have trained our model on one page only of the newspaper and the training was very quick. However, to approach our problem it's best to train our model on the entire dataset. We dont' have the resources for doing that on our local laptop, but luckily for us, [Wevers, M (2019)](https://zenodo.org/records/3237380) did that already for us and released it publicly. Let's download this dataset on our laptop and let's save them in a folder called `w2v`.
414
327
415
-
```python
416
-
folder_path ='data/w2v/'
417
-
```
328
+
::: keypoints
329
+
- The first step for working with text is to run a preprocessing pipeline to obtain clear features
330
+
- We can represent text as vectors of numbers (which makes it interpretable for machines)
331
+
- One of the most efficient and useful ways is to use word embeddings
332
+
- We can easily compute how words are similar to each other with the cosine similarity
- Understand how a Transformer works and recognize their different use cases.
18
18
- Understand how to use pre-trained tranfromers (Use Case: BERT)
19
19
- Use BERT to classify texts.
20
20
- Use BERT as a Named Entity Recognizer.
21
21
- Understand assumptions and basic evaluation for NLP outputs.
22
22
23
-
::::::::::::::::::::::::::::::::::::::::::::::::
23
+
:::
24
24
25
25
Static word embeddings such as Word2Vec can be used to represent each word as a unique vectors. Vector representations also allow us to apply numerical operations that can be mapped to some syntactic and semantic properties of words, such as the cases of analogies or finding synonyms. Once we transform words into vectors, these can also be used as **features** for classifiers that can be trained predict any supervised NLP task.
0 commit comments