Use pretrained opusmt models #192

eu9ene · 2023-09-12T23:00:13Z

related to #180

This PR includes only changes from the GreenNLP fork + compatibility fixes. Full integration with TaskCluster is out of scope for this PR.

…her changes) (#117) * Integrated Tatoeba-Challenge models as part of the firefox-translations-training pipeline: - Added download scripts and rules for downloading Tatoeba-Challenge data and models. - Modified training rules to accept downloaded Tatoeba-Challenge models as teachers and backward models. - Modified containerization to include conda environments inside the container (to abide by CSC's conda depreciation). - Added subword segmentation rules to marian-specific rules (since the default pipeline uses Marian's integrated sentencepiece support and Tatoeba-Challenge models don't) NOTE: The pipeline is still a work in progress, and it may fail for some Tatoeba-Challenge models due to subtle differences in the model make-up. * reduced workspace, since Marian crashes training with larger workspaces (this might be fixed in newer marian versions) * Update README.md Added note about changing CSC account * Update config.opusmt.yml Fixed opusmt-teacher value to URL as it should be * added target language token addition for multilingual models * new test config for multilingual models * fixed data language pair reverse with tatoeba data * added config parameter for pretrained teacher model (only pretrained models using marian sentencepiece integration) * Update flores.sh Fixed swahili code in Flores importer * Working on using multiple teacher models, not ready for action yet * added profiles for csc mahti * Update README.md * multiteach additions * more multiteacher changes * multiple teachers added, monolingual src fixed * fixed vocabs with multiteacher, other minor fixes * fixed dummy mono src rules * fixed model indices if no opus mt teachers * added file for preinstalling snakemake envs (for easier containerization) * added profiles for lumi, support for amd gpus, fixing the broken non-opus-mt training pipeline * both train from scratch and opus-mt teacher should work now * added separate compile script for browsermt marian * new marian-dev submodule version (old one did not work with fp16 and opus models), cuda dirs and root specified in Snakefile if not in config, new makefile targets * updated lumi profiles with automatic paths and energy monitoring * fixing bicleaner-ai (model repository link changed), some more energy use monitoring additions * updated bicleaner-ai env, the old one did not work for some reason * added langid file to bicleaner-ai env also * Update bicleaner-ai.yml Added tensorflow-rocm to bi-cleaner env to get it working on lumi * lumi slurm fixes and bicleaner-ai bug fixing * Update README.md Added instructions for using Snakemake without non-containerized conda installation. * Update README.md Formatting changes. * updated mtdata in base env * updated container to match envs * added env variables required by new clean mono * added separate bicleaner-ai env for lumi * added lumi bicleaner env * added tensorflow to bicleaner-ai env * fixed bicleaner-ai script bug and added a missing argument for train_spm * singularity fixes: kenlm installation, added hunspell dict download, edited local-container profile to work with current Snakefile setup --------- Co-authored-by: Tommi Nieminen <[email protected]> Co-authored-by: Tommi Nieminen <[email protected]> Co-authored-by: Tommi Nieminen <[email protected]>

# Conflicts: # Makefile # pipeline/bicleaner/packs.py # pipeline/cefilter/score.sh # pipeline/translate/collect.sh # pipeline/translate/merge-corpus.sh

# Conflicts: # Snakefile # pipeline/train/spm-vocab.sh # taskcluster/ci/tests/kind.yml # taskcluster/translations_taskgraph/parameters.py

# Conflicts: # Snakefile

TommiNieminen and others added 9 commits September 12, 2023 15:37

Merge branch 'main' into opusmt2

f3b77e2

# Conflicts: # Makefile # pipeline/bicleaner/packs.py # pipeline/cefilter/score.sh # pipeline/translate/collect.sh # pipeline/translate/merge-corpus.sh

Move opusmt readme to a separate dir

ad77c32

Replace curl with wget

d04cee5

Rollback default decoding config

3b03f7e

Fix vocab default arg

d50bb99

Add cancel command

6cb582e

Fis default image

c9470fb

Revert example prod config

a31e347

eu9ene mentioned this pull request Sep 13, 2023

Include test run in CI #193

Closed

eu9ene added 7 commits September 13, 2023 15:05

Remove private repo from git modules

32bbd06

Include dry-run for more configs in CI

48cd5fc

Add a missing image

ad8ad16

Fix renaming

aff9c32

Run formatter

a0f1a86

TaskCluster compatibility fixes

da439bf

Merge branch 'main' into opusmt2

a36f61c

# Conflicts: # Snakefile # pipeline/train/spm-vocab.sh # taskcluster/ci/tests/kind.yml # taskcluster/translations_taskgraph/parameters.py

eu9ene mentioned this pull request Sep 26, 2023

Training Continuation - Use Opus models #213

Closed

Merge branch 'main' into opusmt2

737a6ac

# Conflicts: # Snakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use pretrained opusmt models #192

Use pretrained opusmt models #192

eu9ene commented Sep 12, 2023 •

edited

Loading

Use pretrained opusmt models #192

Are you sure you want to change the base?

Use pretrained opusmt models #192

Conversation

eu9ene commented Sep 12, 2023 • edited Loading

eu9ene commented Sep 12, 2023 •

edited

Loading