-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pretrained opusmt models #192
Draft
eu9ene
wants to merge
17
commits into
main
Choose a base branch
from
opusmt2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…her changes) (#117) * Integrated Tatoeba-Challenge models as part of the firefox-translations-training pipeline: - Added download scripts and rules for downloading Tatoeba-Challenge data and models. - Modified training rules to accept downloaded Tatoeba-Challenge models as teachers and backward models. - Modified containerization to include conda environments inside the container (to abide by CSC's conda depreciation). - Added subword segmentation rules to marian-specific rules (since the default pipeline uses Marian's integrated sentencepiece support and Tatoeba-Challenge models don't) NOTE: The pipeline is still a work in progress, and it may fail for some Tatoeba-Challenge models due to subtle differences in the model make-up. * reduced workspace, since Marian crashes training with larger workspaces (this might be fixed in newer marian versions) * Update README.md Added note about changing CSC account * Update config.opusmt.yml Fixed opusmt-teacher value to URL as it should be * added target language token addition for multilingual models * new test config for multilingual models * fixed data language pair reverse with tatoeba data * added config parameter for pretrained teacher model (only pretrained models using marian sentencepiece integration) * Update flores.sh Fixed swahili code in Flores importer * Working on using multiple teacher models, not ready for action yet * added profiles for csc mahti * Update README.md * multiteach additions * more multiteacher changes * multiple teachers added, monolingual src fixed * fixed vocabs with multiteacher, other minor fixes * fixed dummy mono src rules * fixed model indices if no opus mt teachers * added file for preinstalling snakemake envs (for easier containerization) * added profiles for lumi, support for amd gpus, fixing the broken non-opus-mt training pipeline * both train from scratch and opus-mt teacher should work now * added separate compile script for browsermt marian * new marian-dev submodule version (old one did not work with fp16 and opus models), cuda dirs and root specified in Snakefile if not in config, new makefile targets * updated lumi profiles with automatic paths and energy monitoring * fixing bicleaner-ai (model repository link changed), some more energy use monitoring additions * updated bicleaner-ai env, the old one did not work for some reason * added langid file to bicleaner-ai env also * Update bicleaner-ai.yml Added tensorflow-rocm to bi-cleaner env to get it working on lumi * lumi slurm fixes and bicleaner-ai bug fixing * Update README.md Added instructions for using Snakemake without non-containerized conda installation. * Update README.md Formatting changes. * updated mtdata in base env * updated container to match envs * added env variables required by new clean mono * added separate bicleaner-ai env for lumi * added lumi bicleaner env * added tensorflow to bicleaner-ai env * fixed bicleaner-ai script bug and added a missing argument for train_spm * singularity fixes: kenlm installation, added hunspell dict download, edited local-container profile to work with current Snakefile setup --------- Co-authored-by: Tommi Nieminen <[email protected]> Co-authored-by: Tommi Nieminen <[email protected]> Co-authored-by: Tommi Nieminen <[email protected]>
# Conflicts: # Makefile # pipeline/bicleaner/packs.py # pipeline/cefilter/score.sh # pipeline/translate/collect.sh # pipeline/translate/merge-corpus.sh
Closed
# Conflicts: # Snakefile # pipeline/train/spm-vocab.sh # taskcluster/ci/tests/kind.yml # taskcluster/translations_taskgraph/parameters.py
# Conflicts: # Snakefile
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
related to #180
This PR includes only changes from the GreenNLP fork + compatibility fixes. Full integration with TaskCluster is out of scope for this PR.