You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Installation | Installing dependencies and compiling | CPU | Takes ~1 hour
211
232
Data downloading | Downloads datasets, samples sentences | Network, Disk | Time depends on dataset size, sampling of huge mono datasets (100M+ sentences) is the most intensive operation.
212
-
Data cleaning | Basic preprocessing, language specific, rule based, deduplication, and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](/pipeline/clean/clean_parallel.py).
213
-
Bicleaner | Filters noisy sentence pairs in a parallel corpus using [bicleaner](https://github.com/bitextor/bicleaner) or [bicleaner-ai](https://github.com/bitextor/bicleaner-ai) depending on available language packs. | CPU, GPU | If there are no pretrained language packs for bicleaner-ai, it uses bicleaner. If there are no ones for bicleaner either, this step is skipped. Cleaning threshold is controlled by `BICLEANER_THRESHOLD` config setting.
233
+
Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](/pipeline/clean/tools/clean_parallel.py).
234
+
Bicleaner | Filters noisy sentence pairs in a parallel corpus using [bicleaner](https://github.com/bitextor/bicleaner) or [bicleaner-ai](https://github.com/bitextor/bicleaner-ai) depending on available language packs. | CPU, GPU | If there are no pretrained language packs for bicleaner-ai, it uses bicleaner. If there are no ones for bicleaner either, this step is skipped. Cleaning thresholds are configurable per dataset, see [Dataset cleaning](##Dataset cleaning).
235
+
Merge and dedupe | Merges clean dataset and applies deduplicaiton | CPU, Disk |
214
236
Training s2s | Trains a backward shallow s2s model, which is useful for back-translations and ce-filtering | GPU | Inspired by a [marian example](https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece).
215
-
Augmentation with back-translations | Translates mono corpus combined from `MONO_DATASETS_TRG` using shallow s2s model. | GPU | It is more useful for low-resource languages and can be skipped for others.
216
-
Training teacher | Trains one or multiple big transformer models | GPU | You might want to adjust [early stopping](pipeline/train/configs/training/teacher.transformer.train.yml) parameters depending on datasets size. Inspired by [transformer](https://github.com/marian-nmt/marian-examples/tree/master/transformer) and [wmt2017-uedin](https://github.com/marian-nmt/marian-examples/tree/master/wmt2017-uedin) marian examples and extended with [SentencePiece](https://github.com/google/sentencepiece).
237
+
Augmentation with back-translations | Translates mono corpus combined from monolingual datasets in target language using shallow s2s model. | GPU | It is more useful for low-resource languages and can be skipped for others.
238
+
Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](pipeline/train/configs/training/teacher.transformer.train.yml) or `after-epochs` parameters depending on datasets size.
239
+
Continue training teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](pipeline/train/configs/training/teacher.transformer.train.yml) parameters depending on datasets size.
217
240
Translation by teacher | Translates a corpus and monolingual data combined from `MONO_DATASETS_SRC` using the teacher model (ensemble is not supported yet) | GPU | The slowest part of the pipeline. Can take days. It is possible to speed it up launching the same scripts ([corpus](pipeline/translate/translate-corpus.sh), [mono](pipeline/translate/translate-mono.sh)) in parallel from another machine with access to the same network directory.
218
241
Cross-entropy filtering | Scores translated corpus with backward s2s model and removes a part of the corpus with the lowest scores to reduce noise | GPU, CPU, Disk | At this point we work with huge datasets, so it utilizes copying to a local disk to make things faster.
219
242
Training alignments and shortlist | Trains alignments using [fast_align](https://github.com/clab/fast_align) and extracts lexical shortlist using [extract_lex](https://github.com/marian-nmt/extract-lex) tool | CPU, Disk | Some tools requires uncompressed datasets on disk and they are huge at this point. Data is copied to a local disk to make things faster. Might take 100+GB of local disk depending on a dataset size. Good CPU parallelization.
220
-
Training student | Trains a small transformer student model on filtered data and using alignments | GPU | Run [Tensorboard](utils/tensorboard/tensorboard.sh) manually to see training visualization.
243
+
Training student | Trains a small transformer student model on filtered data and using alignments | GPU |
221
244
Fine-tuning student | Finetunes the student model by emulating 8bit GEMM during training | GPU | Converges very quickly and then degrades. It's quick but you might want to reduce early stopping threshold.
222
245
Quantizaiton | Applies 8 bit quantization to the fined-tuned student model and evaluates on CPU | CPU | CPU threads must be set to 1 for this step.
246
+
Evaluation | Calculates metrics for all models (BLEU, chrf) using [SacreBLEU](https://github.com/mjpost/sacrebleu) | GPU | Uses `datasets.test` configuration section.
223
247
Export | Exports trained model and shortlist to (bergamot-translator)(https://github.com/mozilla/bergamot-translator) format | |
224
248
225
-
## Datasets importers
249
+
## Dataset importers
226
250
227
251
Dataset importers can be used in `datasets` sections of experiment config.
228
252
@@ -256,6 +280,119 @@ Example:
256
280
Just add a shell script to [corpus](pipeline/data/importers/corpus) or [mono]() which is named as `<prefix>.sh`
257
281
and accepts the same parameters as the other scripts from the same folder.
258
282
283
+
## Dataset fixing
284
+
285
+
Some datasets require fixes like detokenization. Dataset and language specific fixes are implemented in [pipeline/clean/fixes]([pipeline/clean/fixes]).
286
+
Naming convention:
287
+
-`<dataset_name>.sh` for parallel dataset cleaning
288
+
-`<dataset_name>.<lang>.sh` for language specific cleaning of parallel or monolingual dataset
289
+
-`/` in dataset name should be replaced with `_`
290
+
291
+
## Dataset cleaning
292
+
Some parallel datasets require more aggressive filtering.
293
+
Dataset specific Bicleaner thretholds can be set in config. Example:
294
+
295
+
```angular2html
296
+
experiment:
297
+
...
298
+
bicleaner:
299
+
default-threshold: 0.5
300
+
dataset-thresholds:
301
+
mtdata_neulab_tedtalksv1_train: 0.6
302
+
```
303
+
304
+
## Utilities
305
+
306
+
### Tensorboard
307
+
308
+
To see training graphs run tensorboard:
309
+
310
+
```
311
+
make install-tensorboard
312
+
make tensorboard
313
+
```
314
+
315
+
Then port forward 6006.
316
+
317
+
## Directory structure
318
+
319
+
├ data
320
+
│ └ ru-en
321
+
│ └ test
322
+
│ ├ original
323
+
│ │ ├ corpus
324
+
│ │ │ ├ mtdata_JW300.en.gz
325
+
│ │ │ └ mtdata_JW300.ru.gz
326
+
│ │ ├ devset
327
+
│ │ │ ├ flores_dev.en.gz
328
+
│ │ │ └ flores_dev.ru.gz
329
+
│ │ ├ eval
330
+
│ │ │ ├ sacrebleu_wmt20.en.gz
331
+
│ │ │ └ sacrebleu_wmt20.ru.gz
332
+
│ │ ├ mono
333
+
│ │ │ ├ news-crawl_news.2020.ru.gz
334
+
│ │ │ └ news-crawl_news.2020.en.gz
335
+
│ │ ├ devset.ru.gz
336
+
│ │ └ devset.en.gz
337
+
│ ├ clean
338
+
│ │ ├ corpus
339
+
│ │ │ ├ mtdata_JW300.en.gz
340
+
│ │ │ └ mtdata_JW300.ru.gz
341
+
│ │ ├ mono
342
+
│ │ │ ├ news-crawl_news.2020.ru.gz
343
+
│ │ │ └ news-crawl_news.2020.en.gz
344
+
│ │ ├ mono.ru.gz
345
+
│ │ └ mono.en.gz
346
+
│ ├ biclean
347
+
│ │ ├ corpus
348
+
│ │ │ ├ mtdata_JW300.en.gz
349
+
│ │ │ └ mtdata_JW300.ru.gz
350
+
│ │ ├ corpus.ru.gz
351
+
│ │ ├ corpus.en.gz
352
+
│ ├ translated
353
+
│ │ ├ mono.ru.gz
354
+
│ │ └ mono.en.gz
355
+
│ ├ augmented
356
+
│ │ ├ corpus.ru.gz
357
+
│ │ └ corpus.en.gz
358
+
│ ├ alignment
359
+
│ │ ├ corpus.aln.gz
360
+
│ │ └ lex.s2t.pruned.gz
361
+
│ ├ merged
362
+
│ │ ├ corpus.ru.gz
363
+
│ │ └ corpus.en.gz
364
+
│ └ filtered
365
+
│ ├ corpus.ru.gz
366
+
│ └ corpus.en.gz
367
+
├ models
368
+
│ ├ ru-en
369
+
│ │ └ test
370
+
│ │ ├ teacher
371
+
│ │ ├ student
372
+
│ │ ├ student-finetuned
373
+
│ │ ├ speed
374
+
│ │ ├ evaluation
375
+
│ │ │ ├ backward
376
+
│ │ │ ├ teacher0
377
+
│ │ │ ├ teacher1
378
+
│ │ │ ├ teacher-ensemble
379
+
│ │ │ ├ student
380
+
│ │ │ ├ student-finetuned
381
+
│ │ │ └ speed
382
+
│ │ └ exported
383
+
│ ├ en-ru
384
+
│ └ test
385
+
│ └ backward
386
+
│
387
+
├ experiments
388
+
│ └ ru-en
389
+
│ └ test
390
+
│ └ config.sh
391
+
├ logs
392
+
│ └ ru-en
393
+
│ └ test
394
+
│ └ clean_corpus.log
395
+
259
396
## Development
260
397
261
398
### Architecture
@@ -271,9 +408,6 @@ Snakemake parallelizes steps that can be executed simultniously. It is especiall
271
408
The main snakemkae process (scheduler) should be launched interactively. It runs job processes on the worker nodes in cluster mode or on a local machine in local mode.
272
409
273
410
### Conventions
274
-
275
-
- All scripts work with respect to repo root directory.
276
-
It allows to not think about relative paths and execution folders.
277
411
278
412
- Scripts inside the `pipeline` directory are independent and operate only using input arguments, input files
0 commit comments