Skip to content

Pre-trained fastText, word2vec, GloVe embeddings for the Armenian language and datasets for their intrinsic and extrinsic evaluation

Notifications You must be signed in to change notification settings

tsolakghukasyan/word-embeddings-eval-hy

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Word Embeddings for the Armenian Language: Intrinsic and Extrinsic Evaluation

Pretrained Embeddings

We release pre-trained word embeddings:

  • 200-dimensional GloVe vectors (.text);
  • 300-dimensional CBOW (.text) and SkipGram (.text) vectors;
  • 200-dimensional fastText vectors (.text, .bin), trained using SkipGram architecture, with char n-grams up to length 3.

The training data for these models was collected from various sources:

a. Wikipedia;
b. fiction texts taken from the open part of the EANC corpus;
c. HC Corpora containing blogs and news articles collected by Hans Christensen from public sources in 2011;
d. digitized and reviewed part of Armenian soviet encyclopedia (as of February 2018) taken from Wikisource;
e. texts from news websites on the following topics: economics, events, art, sports, law, politics, blogs and interviews.

The texts were preprocessed by lowercasing all tokens and removing punctuation, digits. The final dataset contained 90.5 million tokens.

Word Analogy Task

In addition, we publish an adaptation of the word analogy task (Mikolov et al., 2013a) for the Armenian language to serve as benchmark for intrinsic evaluation of vectors. The task contains 5 semantic and 8 syntactic sections, with 15646 analogy questions in total.

News Texts Dataset

For extrinsic evalution of vectors in a classification task, we release a dataset of over 12000 news articles from iLur.am, categorized into 7 classes: sport, politics, weather, economy, accidents, art, society. The articles are split into train (2242k tokens) and test sets (425k tokens).

For more details, refer to the paper.

About

Pre-trained fastText, word2vec, GloVe embeddings for the Armenian language and datasets for their intrinsic and extrinsic evaluation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published