You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Sep 25, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+16-15Lines changed: 16 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,5 @@
1
1
# BERT
2
+
2
3
**\*\*\*\*\* New March 11th, 2020: Smaller BERT Models \*\*\*\*\***
3
4
4
5
This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962).
@@ -78,15 +79,15 @@ the pre-processing code.
78
79
In the original pre-processing code, we randomly select WordPiece tokens to
79
80
mask. For example:
80
81
81
-
`Input Text: the man jumped up, put his basket on phil ##am ##mon ' s head`
82
-
`Original Masked Input: [MASK] man [MASK] up, put his [MASK] on phil
82
+
`Input Text: the man jumped up, put his basket on Phil ##am ##mon ' s head`
83
+
`Original Masked Input: [MASK] man [MASK] up, put his [MASK] on Phil
83
84
[MASK] ##mon ' s head`
84
85
85
86
The new technique is called Whole Word Masking. In this case, we always mask
86
-
*all* of the the tokens corresponding to a word at once. The overall masking
87
+
*all* of the tokens corresponding to a word at once. The overall masking
87
88
rate remains the same.
88
89
89
-
`Whole Word Masked Input: the man [MASK] up, put his basket on [MASK][MASK]
90
+
`Whole Word Masked Input: the man [MASK] up, put his basket on [MASK][MASK]
90
91
[MASK] ' s head`
91
92
92
93
The training is identical -- we still predict each masked WordPiece token
@@ -127,10 +128,10 @@ Mongolian \*\*\*\*\***
127
128
128
129
We uploaded a new multilingual model which does *not* perform any normalization
129
130
on the input (no lower casing, accent stripping, or Unicode normalization), and
130
-
additionally inclues Thai and Mongolian.
131
+
additionally includes Thai and Mongolian.
131
132
132
133
**It is recommended to use this version for developing multilingual models,
133
-
especially on languages with non-Latin alphabets.**
134
+
especially in languages with non-Latin alphabets.**
134
135
135
136
This does not require any code changes, and can be downloaded here:
136
137
@@ -236,7 +237,7 @@ and contextual representations can further be *unidirectional* or
236
237
[GloVe](https://nlp.stanford.edu/projects/glove/) generate a single "word
237
238
embedding" representation for each word in the vocabulary, so `bank` would have
238
239
the same representation in `bank deposit` and `river bank`. Contextual models
239
-
instead generate a representation of each word that is based on the other words
240
+
instead, generate a representation of each word that is based on the other words
240
241
in the sentence.
241
242
242
243
BERT was built upon recent work in pre-training contextual representations —
@@ -270,14 +271,14 @@ and `B`, is `B` the actual next sentence that comes after `A`, or just a random
270
271
sentence from the corpus?
271
272
272
273
```
273
-
Sentence A: the man went to the store.
274
-
Sentence B: he bought a gallon of milk.
274
+
Sentence A: the man went to the store.
275
+
Sentence B: he bought a gallon of milk.
275
276
Label: IsNextSentence
276
277
```
277
278
278
279
```
279
-
Sentence A: the man went to the store.
280
-
Sentence B: penguins are flightless.
280
+
Sentence A: the man went to the store.
281
+
Sentence B: penguins are flightless.
281
282
Label: NotNextSentence
282
283
```
283
284
@@ -405,7 +406,7 @@ Please see the
405
406
for how to use Cloud TPUs. Alternatively, you can use the Google Colab notebook
406
407
"[BERT FineTuning with Cloud TPUs](https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb)".
407
408
408
-
On Cloud TPUs, the pretrained model and the output directory will need to be on
409
+
On Cloud TPUs, the pre-trained model and the output directory will need to be on
409
410
Google Cloud Storage. For example, if you have a bucket named `some_bucket`, you
410
411
might use the following flags instead:
411
412
@@ -477,7 +478,7 @@ that it's running on something other than a Cloud TPU, which includes a GPU.
477
478
478
479
Once you have trained your classifier you can use it in inference mode by using
479
480
the --do_predict=true command. You need to have a file named test.tsv in the
480
-
input folder. Output will be created in file called test_results.tsv in the
481
+
input folder. The output will be created in file called test_results.tsv in the
481
482
output folder. Each line will contain output for each sample, columns are the
482
483
class probabilities.
483
484
@@ -499,7 +500,7 @@ python run_classifier.py \
499
500
500
501
### SQuAD 1.1
501
502
502
-
The Stanford Question Answering Dataset (SQuAD) is a popular questionanswering
503
+
The Stanford Question Answering Dataset (SQuAD) is a popular question-answering
503
504
benchmark dataset. BERT (at the time of the release) obtains state-of-the-art
504
505
results on SQuAD with almost no task-specific network architecture modifications
505
506
or data augmentation. However, it does require semi-complex data pre-processing
0 commit comments