Skip to content

Collection of scripts for training bert-based embedder for Russian<>English embeddings extraction

License

Notifications You must be signed in to change notification settings

EvilFreelancer/enbeddrus

Repository files navigation

Enbedrus - ENglish and RUSsian emBEDDer

This is a BERT (uncased) sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

  • Parameters: 168 million
  • Layers: 12
  • Hidden Size: 768
  • Attention Heads: 12
  • Vocabulary Size: 119,547
  • Maximum Sequence Length: 512 tokens

The Enbeddrus model is designed to extract similar embeddings for comparable English and Russian phrases. It is based on the bert-base-multilingual-uncased model and was trained over 20 epochs on the following datasets:

The goal of this model is to generate identical or very similar embeddings regardless of whether the text is written in English or Russian.

Enbeddrus GGUF version available via Ollama.

Envaluation test

Models tested via encodechka

Name evilfreelancer/enbeddrus-v0.1 evilfreelancer/enbeddrus-v0.1-domain evilfreelancer/enbeddrus-v0.2
STSBTask 0.6418501890569303 0.6418501890569303 0.6382642407246252
ParaphraserTask 0.5396186809125094 0.5396186809125094 0.5491558495250873
XnliTask 0.37045908183632736 0.37045908183632736 0.36666666666666664
SentimentTask 0.7306666666666667 0.7306666666666667 0.7246666666666667
ToxicityTask 0.8923319999999999 0.8923319999999999 0.894758
InappropriatenessTask 0.7092166782043772 0.7092166782043772 0.719323712657756
IntentsTask 0.7086 0.7162 0.7128
IntentsXTask 0.5116 0.46 0.5314
FactRuTask n/a n/a n/a
RudrTask n/a n/a n/a
SpeedTask (cuda) 4.313722451527913 4.339381853739421 4.251763025919597
SpeedTask (cpu) 34.0190052986145 34.990905125935875 34.441959857940674

About

Collection of scripts for training bert-based embedder for Russian<>English embeddings extraction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published