Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multilingual nmt results on IWSLT dataset #7

Open
shaoyangxu opened this issue Jun 26, 2022 · 4 comments
Open

multilingual nmt results on IWSLT dataset #7

shaoyangxu opened this issue Jun 26, 2022 · 4 comments

Comments

@shaoyangxu
Copy link

shaoyangxu commented Jun 26, 2022

作者你好,我使用你提供的yml跑了一个multilingual的实验,然后用fairseq-generate进行评估,两种不同的情况下,fa\he\pl\it\ar\es\de\nl\avg的结果如下:

  1. detokenziedBLEU
    我设置以下参数:bpe: subword_nmt,bpe_codes: /home/syxu/data/iwslt14/code,scoring: sacrebleu,tokenizer: moses。得到的结果为:
17.1 29.0 17.0 29.5 20.9 35.6 28.1 31.4 26.073
  1. tokenizedBLEU
    相反,我1. 删除上面的tokenizer参数;2. 把fairseq\fairseq\scoring\bleu.py中SacrebleuConfig的sacrebleu_tokenizer设置成'none'。得到的结果为:
17.5 29.5 17.3 30.1 21.4 36.1 28.6 31.9 26.540
  1. 而文中给出的结果为:
16.9 29 16.4 29.2 20.9 35.2 28.1 30.9 25.825

可以看到,文中给出的结果应该更接近于detokenziedBLEU?但4. Experiment Settings中说,“We report tokenized BLEU”这是为何?

@linzehui
Copy link
Contributor

linzehui commented Jun 27, 2022

@beiweixiaoxu 我们的测试流程是这样: fairseq-generate → 提取出hypothesis → 过tokenizer → sacrebleu测试,-tok参数设为'none'。也即类似这样:
fairseq-generate ${databin} \ --path ${ckp_file} --beam 5 --remove-bpe='sentencepiece' \ --batch-size 128 \ > ${log_file}

cat ${log_file} | grep -P "^H" | cut -f 3- | sh tok.sh ${tgt} > ${log_file}.hypo cat ${log_file} | grep -P "^T" | cut -f 2- | sh tok.sh ${tgt} > ${log_file}.ref sacrebleu -tok 'none' ${log_file}.ref < ${log_file}.hypo
其中这里的tok.sh使用的是fairseq m2m100的脚本

直接看fairseq-generate上的结果可能会有一些问题,用sacrebleu测试比较标准一些

@shaoyangxu
Copy link
Author

@linzehui 我大概理解你的意思是,先用fairseq-generate生成完全纯净的hyp和ref文本(比如要去除bpe、如果一开始做了tokenize的话还需要做detokenize),再用m2m 100的脚本对纯净hyp\ref文本进行tokenize,最后使用tok='none'的sacrebleu进行测试。

然后我说一下我的流程,对于iwslt14,由于我使用的是https://github.com/RayeRen/multilingual-kd-pytorch/blob/master/data/iwslt/raw/prepare-iwslt14.sh 做的预处理,它是先使用mosesdecoder做的tokenize,然后用subword-nmt训的bpe,并没有使用sentencepiece。因此,我的运行流程大概是:

  1. fairseq-generate ${databin} \ --path ${ckp_file} --remove-bpe='subword_nmt' --tokenizer='moses' > ${log_file}
  2. cat ${log_file} | grep -P "^D" | cut -f 3- | sh tok.sh ${tgt} > ${log_file}.hypo
  3. cat ${log_file} | grep -P "^T" | cut -f 2- | sh tok.sh ${tgt} > ${log_file}.ref
  4. sacrebleu -tok 'none' ${log_file}.ref < ${log_file}.hypo

这样我得到的结果为(由于m2m 100的脚本,对于ar的tokenize还要装其它环境,我就没有评估):

17.85 29.5 17.15 30 ? 35.95 28.55 31.7 ?

看起来依然会偏高?

@linzehui
Copy link
Contributor

linzehui commented Jul 4, 2022

@beiweixiaoxu 我看了一下之前的脚本,原先说的流程是用在wmt上的;在iwslt我直接report的fairseq-generate的bleu。

fairseq-generate ${databin} --user-dir ../fairseq_code \ --path ${ckp} --task ${task} \ --gen-subset test \ --source-lang ${src} --target-lang ${tgt} \ --remove-bpe --batch-size 128 \ --encoder-langtok "src" --decoder-langtok \ --langs fa,he,pl,it,ar,es,de,nl,en \ --lang-pairs ar-en,de-en,en-ar,en-de,en-es,en-fa,en-he,en-it,en-nl,en-pl,es-en,fa-en,he-en,it-en,nl-en,pl-en \ > ${log_path}/${direction}_${suffix}.log

@shaoyangxu
Copy link
Author

shaoyangxu commented Jul 16, 2022

@linzehui 好的明白。我目前采用上述方法依然得不到文中的结果,不过感觉lass确实是有效的,目前的结果是26.619(lass) vs 26.073(multilingual)。
另外我发现代码中存在一个问题:toolbox/generate_mask.py#L110,这里的exclude_output_proj参数并不能控制是否对output projection进行裁切,因为删除的是state中的内容,而没有删除mask_dict中的内容。我也发现如果把output projection也进行裁切的话,lass的性能会差0.1个点(26.519)。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants