Need clarification for pre-training

In the README.md, it says for the pre-training:
```
It is important that these be actual sentences 
for the "next sentence prediction" task
```
and the example `sample_text.txt` does have each line ends with either `.` or `;`.

Whereas in the BERT paper, it says
```
... we sample two spans of text from the corpus, which we refer to as "sentences" 
even though they are typically much longer than single sentences 
(but can be shorter also)
```

So it becomes unclear whether this implementation does expect **actual** sentences per line or just documents be broken down into multiple lines arbitrarily.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Need clarification for pre-training #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Need clarification for pre-training #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions