Skip to content

dumitrescustefan/Romanian-Transformers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Romanian Transformers

This repo is meant as a space to centralize Romanian Transformers and to provide a uniform evaluation. Contributions are welcome.

We're using HuggingFace's Transformers lib, an awesome tool for NLP. What's BERT you ask? Here's a clear and condensed article about what BERT is and what it can do. Also check out this summary of different transformer models.

What follows is the list of Romanian transformer models, both masked and conditional language models.

Feel free to open an issue and add your model/eval here!

Masked Language Models (MLMs)

Model Type Size Article/Citation/Source Pre-trained / Fine-tuned Release Date
dumitrescustefan/bert-base-romanian-cased-v1 BERT 124M PDF / Cite Pre-trained Apr, 2020
dumitrescustefan/bert-base-romanian-uncased-v1 BERT 124M PDF / Cite Pre-trained Apr, 2020
racai/distillbert-base-romanian-cased DistilBERT 81M - Pre-trained Apr, 2021
readerbench/RoBERT-small BERT 19M PDF Pre-trained May, 2021
readerbench/RoBERT-base BERT 114M PDF Pre-trained May, 2021
readerbench/RoBERT-large BERT 341M PDF Pre-trained May, 2021
dumitrescustefan/bert-base-romanian-ner BERT 124M HF Space Named Entity Recognition on RONECv2 Jan, 2022
snisioi/bert-legal-romanian-cased-v1 BERT 124M - Legal documents on MARCELLv2 Jan, 2022
readerbench/jurBERT-base BERT 111M PDF Legal documents Oct, 2021
readerbench/jurBERT-large BERT 337M PDF Legal documents Oct, 2021

Generative Language Models (CLMs)

Model Type Size Article/Citation/Source Pre-trained / Fine-tuned Release Date
dumitrescustefan/gpt-neo-romanian-780m GPT-Neo 780M not yet / HF Space Pre-trained Sep, 2022
readerbench/RoGPT2-base GPT2 124M PDF Pre-trained Jul, 2021
readerbench/RoGPT2-medium GPT2 354M PDF Pre-trained Jul, 2021
readerbench/RoGPT2-large GPT2 774M PDF Pre-trained Jul, 2021

NEW: Check out this HF Space to play with Romanian generative models: https://huggingface.co/spaces/dumitrescustefan/romanian-text-generation

Model evaluation

Models are evaluated using the public Colab script available here. All results reported are the average score of 5 runs, using the same parameters. For larger models, if it was possible, a larger batch-size was simulated by accumulating gradients, such that all models should have the same effective batch size. Only standard models (not finetuned for a particular task) and that could fit in 16GB of RAM are evaluated.

The tests cover the following fields, and, for brevity, we select a single metric from each field:

  • Named Entity Recognition: on RONECv2 we measure the test strict match measure. A model must correctly detect whether a word is an entity and tag it with its correct class.
  • Part of Speech Tagging: on ro-pos-tagger we measure the test UPOS F1 score. This test should reveal how well a model understands the language's structure.
  • Semantic Textual Similarity: on RO-STS we measure the test Pearson correlation coefficient. Given two sentences the model must predict whether they are entailed, contradictory or are on different subjects (neutral). This test should highlight how well a model can embed the meaning of a sentence.
  • Emotion Detection: on the REDv2 emotion detection in Romanian Tweets we measure the test Hamming loss in the classification setting (lower is better). This test should show how well a model can "understand" emotions from short texts.
  • Perplexity: on wiki-ro's test split, we measure CLM-only models' perplexity with a stride of 512 and a batch size of 4.

MLM model evaluation

Model Type Size NER/EM_strict RoSTS/Pearson Ro-pos-tagger/UPOS F1 REDv2/hamming_loss
dumitrescustefan/bert-base-romanian-cased-v1 BERT 124M 0.8815 0.7966 0.982 0.1039
dumitrescustefan/bert-base-romanian-uncased-v1 BERT 124M 0.8572 0.8149 0.9826 0.1038
racai/distillbert-base-romanian-cased DistilBERT 81M 0.8573 0.7285 0.9637 0.1119
readerbench/RoBERT-small BERT 19M 0.8512 0.7827 0.9794 0.1085
readerbench/RoBERT-base BERT 114M 0.8768 0.8102 0.9819 0.1041

CLM model evaluation

Model Type Size NER/EM_strict RoSTS/Pearson Ro-pos-tagger/UPOS F1 REDv2/hamming_loss Perplexity
readerbench/RoGPT2-base GPT2 124M 0.6865 0.7963 0.9009 0.1068 52.34
readerbench/RoGPT2-medium GPT2 354M 0.7123 0.7979 0.9098 0.114 31.26

What you can do with these models

Using HuggingFace's Transformers lib, instantiate a model and replace the model name as necessary. Then use an appropriate model head depending on your task. Here are a few examples:

Get token embeddings
from transformers import AutoTokenizer, AutoModel
import torch

# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")

# tokenize a sentence and run through the model
input_ids = tokenizer.encode("Acesta este un test.", add_special_tokens=True, return_tensors="pt")
outputs = model(input_ids)

# get encoding
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
  • For dumitrescustefan/* models, remember to correct the ș/ț diacritics before feeding it to the model (it was trained only with the correct, comma-style diacritics, and will see the cedilla ş an ţ as UNKs and thus decrease overall performance):
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

Write text with generative models

Give a prompt to a generative model and let it write:

tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/gpt-neo-romanian-125m")
model = AutoModelForCausalLM.from_pretrained("dumitrescustefan/gpt-neo-romanian-125m")

input_ids = tokenizer.encode("Cine a fost Mihai Eminescu? A fost", return_tensors='pt')

text = model.generate(input_ids, max_length=128, do_sample=True, no_repeat_ngram_size=2, top_k=50, top_p=0.9, early_stopping=True)

print(tokenizer.decode(text[0], skip_special_tokens=True))

P.S. You can test all generative models here: https://huggingface.co/spaces/dumitrescustefan/romanian-text-generation

Final note

  • While this repo initially started as an in-depth of a single transformer model back in 2020, with the express hope that more models would be added quickly, it turned out that training a good model is not that easy, and it takes a lot of effort to curate the data and then have access to sufficient compute power. So, I feel it's no longer useful to just list a couple of models, and it would make more impact to list all the models I could find that are Romanian-only, and have a minimal level of performance/documentation. Here you go :)
  • This repo contained some code to download and clean a Romanian corpus. I have removed this part as Oscar is now offered on HuggingFace (new version), and OPUS's API is no longer working as it should (some manual filtering is now required, not to mention new resources are being added constantly) - thus maintaining this code is not really feasible.
  • Please contribute to this repo with new Romanian models you mihgt find, or with citations or updates to existing models.