A small Romanian BERT model

This is the repository with which to build a small Romanian BERT model based on the CoRoLa corpus.
It needs the Romanian aware WordPiece tokenizer that is avaiable here.

BERT parameters are:

maximum sequence length 256
L = 4 (4 stacked layers)
Η = 256 (size of the hidden layer)
number of attention heads 8

The model is pretrained with the MLM objective, mlm_probability=0.15.
It takes approximately 30 days to train on an NVIDIA Quadro RTX 8000 with 48GB of RAM.

Installation

The Romanian BERT model requires the Romanian WordPiece tokenizer ro-wordpiece-tokenizer (see above).
You have to clone it and this repository (ro-corola-bert-small) in the same folder and

export PYTHONPATH=../ro-wordpiece-tokenizer

while in the ro-corola-bert-small folder.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
model		model
.env		.env
.gitignore		.gitignore
README.md		README.md
corola_bert.py		corola_bert.py
corola_data.py		corola_data.py
corola_number.py		corola_number.py
corola_pl_nouns.txt		corola_pl_nouns.txt
corola_test.py		corola_test.py
data_selection.py		data_selection.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A small Romanian BERT model

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A small Romanian BERT model

Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages