Author: Tsolak Ghukasyan
Project advisor: Adam Mathias Bittlingmayer
Lemmatization tools are still usually implemented with rules and lookup tables even in today's top libraries, which require linguistic knowledge of each language to build.
d-lemma is developing simple universal models for learning lemmatization, using only annotated text datasets and word embeddings.
d-lemma models support a growing set of languages - lemma-annotated UD treebanks and fastText embeddings are publicly available for over 60 different languages.
In this project, 6 different approaches were considered.
To understand the evaluation of the developed learning models, 2 baseline approaches are used:
-
Identity baseline
Identity function, i.e. returning the input token as its lemma, serves as a weak baseline for main models. -
Most common lemma with identity backoff
Returning the most common lemma serves as a stronger baseline for developed models. This baseline backs off to identity for unknown words.
The 4 learning models are:
-
Linear regression
A linear regressor with cosine proximity loss that for each input token tries to produce its lemma's embedding. This lemmatizer does not use contextual information during prediction. -
Regression with LSTM
A recurrent neural network with a single LSTM unit that receives the sequence of input tokens' embeddings and produces the embeddings of their lemmas. -
Seq2seq
A word level sequence-to-sequence model using LSTM cells. This model receives a sequence of tokens as input and produces the sequence of their lemmas. -
Transformer
An encoder-decoder model based on self-attention mechanism, introduced by Google in Attention Is All You Need. Similar to seq2seq, it processes a sequence of input tokens to output the sequence of their lemmas.
Other model ideas were also considered such as LSTM networks with softmax layers, however these were rejected because of memory and performance requirements.
Two languages were selected for training and evaluation of the aforementioned models: English as a relatively low-morphology language and Finnish as a high-morphology language.
Since one of this project's goals is developing a lemmatization model for low-resource languages, the models were trained with only a 10000-token subset of the respective UD treebanks and another 2000 tokens for validation. Below are the evaluation results on 2000-token test sets:
Results for English:
Model | Accuracy | BLEU |
---|---|---|
identity | 78.15% | 0.579 |
most common | 91.40% | 0.773 |
linear reg. | 87.55% | 0.685 |
LSTM | 93.0% | - |
transformer | - | 0.439 |
Results for Finnish:
Model | Accuracy | BLEU |
---|---|---|
identity | 47.35% | 0.128 |
most common | 66.50% | 0.285 |
linear reg. | 73.15% | 0.389 |
LSTM | 75.07% | - |
transformer | - | - |
*Word-level seq2seq without attention did not produce any meaningful results.
Because the output of transformer and seq2seq models is of variable length, it may contain a different number or order of tokens than the input, so it is not possible to give a token-level accuracy score.
Sample output of the learned LSTM lemmatizer for English:
>>> lemmatize("I knew him because he had attended my school .".split(' '))
['I', 'know', 'he', 'because', 'he', 'have', 'attend', 'my', 'school', '.']
The linear and LSTM regressors can be easily adapted for new languages.
To train and evaluate a new model, you can use linear_models.ipynb
, lstm_model.ipynb
Jupyter notebooks. All you need to do is set the paths to the CoNLL-U treebanks and word embeddings files at the beginning of the notebook (n.b. only FORM and LEMMA columns of the treebank are used).
It can be clearly seen that advanced deep learning models do not perform well in this task, with the main reasons being limited training data and difficulty of hyperparameter tuning.
At the same time, a simple linear regression model demonstrates results very close to the strong baseline, and for Finnish even outperforms it. Among considered approaches the highest accuracy was achieved with the LSTM network, which beat both baselines for both languages.
The regressors learn to lemmatize not only very common words such as 'are', 'got', 'was' etc, but also seem to learn certain relations (e.g. 'killed'-'kill', 'said'-'say' 'years'-'year'). In addition, these models demonstrate capability to lemmatize unseen wordforms (e.g. 'submitted'-'submit', 'replacing'-'replace').
For further research of advanced deep learning approaches' efficiency, it could be useful to experiment with the following models:
- word-level seq2seq with attention
- char-level seq2seq with attention
- DeepMind's relation networks
It could also be useful to slice the evaluation metrics by word frequency or length, to understand how the approaches differ.
For training and evaluation:
UD treebanks: universaldependencies.org
Word embeddings:
- for Finnish, the vectors trained on Common Crawl and Wikipedia: fasttext.cc/docs/en/crawl-vectors.html
- for English, the vectors trained on Common Crawl (600B tokens): fasttext.cc/docs/en/english-vectors.html
A Neural Lemmatizer for Bengali
Abhisek Chakrabarty et al.
Context Sensitive Lemmatization Using Two Successive Bidirectional Gated Recurrent Networks
Abhisek Chakrabarty et al.