Skip to content

Commit

Permalink
Add evaluation in CI (#8)
Browse files Browse the repository at this point in the history
- move evaluation results from 'firefox-translations-evaluation' repo
- run evaluation for PRs inside this repo
- automatically commit evaluation results
  • Loading branch information
eu9ene authored Jul 26, 2021
1 parent 6a291ab commit 71b77fc
Show file tree
Hide file tree
Showing 232 changed files with 670 additions and 13 deletions.
72 changes: 61 additions & 11 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,73 @@ jobs:
- image: gcr.io/google.com/cloudsdktool/cloud-sdk:348.0.0
working_directory: ~/mozilla/firefox-translations-models
steps:
- run: |
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt-get install git-lfs
git lfs install
- run:
name: Installing git lfs
command: |
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt-get install git-lfs
git lfs install
- checkout
- run: |
echo ${GCLOUD_SERVICE_KEY} | gcloud auth activate-service-account --key-file=-
gcloud --quiet config set project ${GOOGLE_PROJECT_ID}
gzip -dr */*/*.gz
gsutil -m cp -rZn prod/* gs://bergamot-models-sandbox/${CIRCLE_TAG}/
gsutil -m cp -rZn dev/* gs://bergamot-models-sandbox/${CIRCLE_TAG}/
- run:
name: Uploading to GCS
command: |
bash scripts/upload.sh
evaluate:
machine:
image: ubuntu-2004:202104-01
resource_class: xlarge
working_directory: ~/mozilla/firefox-translations-models
steps:
- add_ssh_keys:
fingerprints:
- "11:60:82:e2:71:39:67:44:07:4c:16:8f:3d:89:6d:db"
- run:
name: Installing git lfs
command: |
sudo apt-get update
sudo apt-get install -y git-lfs
sudo git lfs install
- checkout
- run:
name: Running evaluation
command: |
bash scripts/update-results.sh
- run:
name: Showing results
command: |
git add evaluation/*/*/*.bleu
git --no-pager diff --staged evaluation/*/*/*.bleu
- run:
name: Pushing results
command: |
git config user.email "ci-models-evaluation@firefox-translations"
git config user.name "CircleCI evaluation job"
git add evaluation/*/*/*.bleu
git add evaluation/*/img/*.png
git add evaluation/*/results.md
if [[ $(git status --porcelain) ]]; then
echo "### Commiting results"
git commit -m "Update evaluation results [skip ci]"
git push --set-upstream origin "$CIRCLE_BRANCH"
else
echo "### Nothing to commit"
fi
workflows:
version: 2
ci:
jobs:
- deploy:
filters:
branches:
ignore: /.*/
tags:
only: /\d*\.\d*\.\d*/
only: /\d*\.\d*\.\d*/
- evaluate:
filters:
branches:
# Forked pull requests have CIRCLE_BRANCH set to pull/XXX
ignore:
- /pull\/[0-9]+/
- main
131 changes: 131 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

.idea
11 changes: 9 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,17 @@ CPU-optimized NMT models for [Firefox Translations](https://github.com/mozilla-e

The model files are hosted using [Git LFS](https://docs.github.com/en/github/managing-large-files/versioning-large-files/about-git-large-file-storage).

[prod](models/prod) - production quality models

**prod** - production quality models
[dev](models/dev) - test models under development (can be of low quality or speed)

**dev** - test models under development (can be with low quality or speed)
# Evaluation results

Quality evaluation is done using [firefox-translations-evaluation]() tool.

[prod](evaluation/prod/results.md)

[dev](evaluation/dev/results.md)


# Currently supported Languages
Expand Down
Binary file added evaluation/dev/img/avg.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added evaluation/dev/img/ru-en.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
76 changes: 76 additions & 0 deletions evaluation/dev/results.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# What is BLEU

[BLEU (BiLingual Evaluation Understudy)](https://en.wikipedia.org/wiki/BLEU) is a metric for automatically evaluating machine-translated text. The BLEU score is a number between zero and one that measures the similarity of the machine-translated text to a set of high quality reference translations. A value of 0 means that the machine-translated output has no overlap with the reference translation (low quality) while a value of 1 means there is perfect overlap with the reference translations (high quality).

It has been shown that BLEU scores correlate well with human judgment of translation quality. Note that even human translators do not achieve a perfect score of 1.0.

BLEU scores are expressed as a percentage rather than a decimal between 0 and 1.
Trying to compare BLEU scores across different corpora and languages is strongly discouraged. Even comparing BLEU scores for the same corpus but with different numbers of reference translations can be highly misleading.

However, as a rough guideline, the following interpretation of BLEU scores (expressed as percentages rather than decimals) might be helpful.

BLEU Score | Interpretation
--- | ---
< 10 | Almost useless
10 - 19 | Hard to get the gist
20 - 29 | The gist is clear, but has significant grammatical errors
30 - 40 | Understandable to good translations
40 - 50 | High quality translations
50 - 60 | Very high quality, adequate, and fluent translations
\> 60 | Quality often better than human

[More mathematical details](https://cloud.google.com/translate/automl/docs/evaluate#the_mathematical_details)

Source: https://cloud.google.com/translate/automl/docs/evaluate#bleu


BLEU is the most popular becnhmark in academia, so using BLEU allows us also to compare with reserach papers results and competitions (see [Conference on Machine Translation Conference (WMT)](http://statmt.org/wmt21/)).

Read [this article](https://www.rws.com/blog/understanding-mt-quality-bleu-scores/) to better understand what BLEU is and why it is not perfect.

# What are these benchmarks

## Translators

1. **bergamot** - uses compiled [bergamot-translator](https://github.com/mozilla/bergamot-translator) (wrapper for marian that is used by Firefox Translations web extension)
2. **marian** - uses compiled [marian](https://github.com/marian-nmt/marian-dev) (translation engine bergamot-translator is based on)
3. **google** - uses Google Translation [API](https://cloud.google.com/translate)
4. **microsoft** - uses Azure Cognitive Services Translator [API](https://azure.microsoft.com/en-us/services/cognitive-services/translator/)

Translation quality of Marian and Bergamot is supposed to be very similar.

## Method

We use official WMT ([Conference on Machine Translation](http://statmt.org/wmt21/)) parallel datasets. Available datsets are discovered automatically based on a language pair.

We perform translation from source to target langauge using one of 4 translation systems and then compare the result with the dataset reference and calculate BLEU score.

Evaluation is done using [SacreBLEU](https://github.com/mjpost/sacrebleu) tool which is reliable and widely used in academic world.

Both absolute and relative differences in BLEU scores between Bergamot and other systems are reported.

# Evaluation results

`avg` = average on all datasets



## avg

| Translator/Dataset | ru-en |
| --- | --- |
| bergamot | 30.91 |
| google | 36.59 (+5.68, +18.37%) |
| microsoft | 37.03 (+6.12, +19.81%) |

![Results](img/avg.png)

## ru-en

| Translator/Dataset | mtedx_test | wmt19 | wmt17 | wmt14 | wmt15 | wmt16 | wmt13 | wmt18 | wmt20 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| bergamot | 23.50 | 37.00 | 33.80 | 34.60 | 30.60 | 29.60 | 26.90 | 28.80 | 33.40 |
| google | 25.00 (+1.50, +6.38%) | 42.40 (+5.40, +14.59%) | 41.50 (+7.70, +22.78%) | 41.20 (+6.60, +19.08%) | 37.50 (+6.90, +22.55%) | 36.60 (+7.00, +23.65%) | 31.40 (+4.50, +16.73%) | 36.00 (+7.20, +25.00%) | 37.70 (+4.30, +12.87%) |
| microsoft | 26.10 (+2.60, +11.06%) | 42.60 (+5.60, +15.14%) | 41.60 (+7.80, +23.08%) | 41.70 (+7.10, +20.52%) | 37.80 (+7.20, +23.53%) | 37.60 (+8.00, +27.03%) | 31.20 (+4.30, +15.99%) | 36.90 (+8.10, +28.12%) | 37.80 (+4.40, +13.17%) |

![Results](img/ru-en.png)
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/mtedx_test.bergamot.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
23.5
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/mtedx_test.google.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
25.0
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/mtedx_test.microsoft.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
26.1
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt13.bergamot.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
26.9
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt13.google.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
31.4
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt13.microsoft.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
31.2
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt14.bergamot.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
34.6
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt14.google.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
41.2
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt14.microsoft.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
41.7
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt15.bergamot.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
30.6
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt15.google.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
37.5
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt15.microsoft.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
37.8
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt16.bergamot.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
29.6
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt16.google.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
36.6
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt16.microsoft.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
37.6
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt17.bergamot.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
33.8
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt17.google.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
41.5
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt17.microsoft.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
41.6
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt18.bergamot.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
28.8
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt18.google.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
36.0
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt18.microsoft.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
36.9
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt19.bergamot.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
37.0
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt19.google.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
42.4
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt19.microsoft.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
42.6
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt20.bergamot.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
33.4
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt20.google.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
37.7
1 change: 1 addition & 0 deletions evaluation/dev/ru-en/wmt20.microsoft.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
37.8
1 change: 1 addition & 0 deletions evaluation/prod/cs-en/wmt08.bergamot.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
24.5
1 change: 1 addition & 0 deletions evaluation/prod/cs-en/wmt08.google.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
26.3
1 change: 1 addition & 0 deletions evaluation/prod/cs-en/wmt08.microsoft.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
26.4
1 change: 1 addition & 0 deletions evaluation/prod/cs-en/wmt09.bergamot.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
27.6
1 change: 1 addition & 0 deletions evaluation/prod/cs-en/wmt09.google.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
29.9
1 change: 1 addition & 0 deletions evaluation/prod/cs-en/wmt09.microsoft.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
29.6
1 change: 1 addition & 0 deletions evaluation/prod/cs-en/wmt10.bergamot.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
28.2
1 change: 1 addition & 0 deletions evaluation/prod/cs-en/wmt10.google.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
30.5
1 change: 1 addition & 0 deletions evaluation/prod/cs-en/wmt10.microsoft.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
30.7
1 change: 1 addition & 0 deletions evaluation/prod/cs-en/wmt11.bergamot.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
28.1
1 change: 1 addition & 0 deletions evaluation/prod/cs-en/wmt11.google.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
30.2
1 change: 1 addition & 0 deletions evaluation/prod/cs-en/wmt11.microsoft.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
30.9
1 change: 1 addition & 0 deletions evaluation/prod/cs-en/wmt12.bergamot.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
26.5
1 change: 1 addition & 0 deletions evaluation/prod/cs-en/wmt12.google.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
28.6
1 change: 1 addition & 0 deletions evaluation/prod/cs-en/wmt12.microsoft.en.bleu
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
29.7
Loading

0 comments on commit 71b77fc

Please sign in to comment.