This repository was archived by the owner on Sep 25, 2025. It is now read-only.

Description
In the README.md, it says for the pre-training:
It is important that these be actual sentences
for the "next sentence prediction" task
and the example sample_text.txt does have each line ends with either . or ;.
Whereas in the BERT paper, it says
... we sample two spans of text from the corpus, which we refer to as "sentences"
even though they are typically much longer than single sentences
(but can be shorter also)
So it becomes unclear whether this implementation does expect actual sentences per line or just documents be broken down into multiple lines arbitrarily.