-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single-side deduplication #928
Comments
We are de-duplicating based on source and target. |
I was investigating another issue and I saw this for en-zh tokenized corpus:
correspond to
Google translate translates So probably when we train en-zh it's ok to leave it, but for zh-en it would make sense to do source-side deduplication, otherwise we have 4 different translations for the same Chinese phrase. The question here is which translation is correct... It would make sense to run some model to score each of them and pick the best one instead of naive deduplication. |
Picking the one with best BCAI score?
|
@ZJaume What is the BCAI score? |
Bicleaner AI, sorry 😅 |
I've been noticing that there are several corpora (NLLB, LinguaTools-WikiTitles and others) in Japanese and Korean that would benefit from this. It is quite common to find sentences in one side or the other being repeated multiple times and aligned each time with sentences completely different. For example this sentence in Korean NLLB:
Luckily, we have Bicleaner AI as a safeguard (see the scores that I added in the third column to illustrate) and it seems to assign scores that would cause discard for all the pairs that I randomly selected. |
It's a good idea to use it when looking at data. I integrated bicleaner-ai with OpusCleaner when I did cleaning experiments. It's quite easy in the code, but the installation is cumbersome, so I didn't create a PR. |
Some experiments that a colleague did during MaCoCu project, found that deduplication taking into account only source side or target side, improved translation quality. IIRC it was not clear what was better, to do it on the source or on the target, but both were better than deduplicating In some cases I think it was about 1 BLEU point for mid-resource languages. This probably reduces the amount of translation inconsistencies.
I couldn't found the table with the results, but I think this is worth exploring.
Maybe you are already doing this, but I was not sure. At least in the old pipeline
dedupe
is using the whole sentence pair.The text was updated successfully, but these errors were encountered: