Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single-side deduplication #928

Open
Tracked by #216
ZJaume opened this issue Nov 13, 2024 · 7 comments
Open
Tracked by #216

Single-side deduplication #928

ZJaume opened this issue Nov 13, 2024 · 7 comments
Labels
quality Improving robustness and translation quality

Comments

@ZJaume
Copy link
Collaborator

ZJaume commented Nov 13, 2024

Some experiments that a colleague did during MaCoCu project, found that deduplication taking into account only source side or target side, improved translation quality. IIRC it was not clear what was better, to do it on the source or on the target, but both were better than deduplicating In some cases I think it was about 1 BLEU point for mid-resource languages. This probably reduces the amount of translation inconsistencies.

I couldn't found the table with the results, but I think this is worth exploring.

Maybe you are already doing this, but I was not sure. At least in the old pipeline dedupe is using the whole sentence pair.

@ZJaume ZJaume added the quality Improving robustness and translation quality label Nov 13, 2024
@gregtatum
Copy link
Member

We are de-duplicating based on source and target.

@eu9ene
Copy link
Collaborator

eu9ene commented Dec 13, 2024

I was investigating another issue and I saw this for en-zh tokenized corpus:

Natural ▁ compound

Natural ▁ products

Natural ▁ product

Natural ▁ Products

correspond to

天然 产物

天然 产物

天然 产物

天然 产物

Google translate translates 天然 产物 as Natural Products.

So probably when we train en-zh it's ok to leave it, but for zh-en it would make sense to do source-side deduplication, otherwise we have 4 different translations for the same Chinese phrase. The question here is which translation is correct... It would make sense to run some model to score each of them and pick the best one instead of naive deduplication.

@ZJaume
Copy link
Collaborator Author

ZJaume commented Dec 16, 2024

Picking the one with best BCAI score?

天然 产物       Natural compund 0.548
天然 产物       Natural products        0.889
天然 产物       Natural product 0.881
天然 产物       Natural Products        0.901

@gregtatum
Copy link
Member

@ZJaume What is the BCAI score?

@ZJaume
Copy link
Collaborator Author

ZJaume commented Dec 17, 2024

Bicleaner AI, sorry 😅

@ZJaume
Copy link
Collaborator Author

ZJaume commented Jan 21, 2025

I've been noticing that there are several corpora (NLLB, LinguaTools-WikiTitles and others) in Japanese and Korean that would benefit from this. It is quite common to find sentences in one side or the other being repeated multiple times and aligned each time with sentences completely different.

For example this sentence in Korean NLLB:

Items will be avialable until they are dispersed.       그것들은 주어진 평면 안에서 부분적으로 점유되면서 펼    0.078
throats, whereas they were only amicably engaged in disentangling their 그것들은 주어진 평면 안에서 부분적으로 점유되면서 펼    0.340
They will just randomly break though, so having a couple will come in handy.    그것들은 주어진 평면 안에서 부분적으로 점유되면서 펼    0.022
When people recriminated him about his gloating, Alon doubled down.     그것들은 주어진 평면 안에서 부분적으로 점유되면서 펼    0.000
They are riding away from the city on a horse when it bucks them.       그것들은 주어진 평면 안에서 부분적으로 점유되면서 펼    0.003
Groups will come together to solve an issue, then disperse.     그것들은 주어진 평면 안에서 부분적으로 점유되면서 펼    0.014

Luckily, we have Bicleaner AI as a safeguard (see the scores that I added in the third column to illustrate) and it seems to assign scores that would cause discard for all the pairs that I randomly selected.

@eu9ene
Copy link
Collaborator

eu9ene commented Jan 21, 2025

It's a good idea to use it when looking at data. I integrated bicleaner-ai with OpusCleaner when I did cleaning experiments. It's quite easy in the code, but the installation is cumbersome, so I didn't create a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quality Improving robustness and translation quality
Projects
None yet
Development

No branches or pull requests

3 participants