Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use EncoderDecoderModel to do en-de translation? #8944

Open
CharizardAcademy opened this issue Dec 6, 2020 · 19 comments
Open

how to use EncoderDecoderModel to do en-de translation? #8944

CharizardAcademy opened this issue Dec 6, 2020 · 19 comments

Comments

@CharizardAcademy
Copy link

I have trained a EncoderDecoderModel from huggging face to do english-German translation task. I tried to overfit a small dataset (100 parallel sentences), and use model.generate() then tokenizer.decode() to perform the translation. However, the output seems to be proper German sentences, but it is definitely not the correct translation.

Here are the code for building the model

encoder_config = BertConfig()
decoder_config = BertConfig()
config = EncoderDecoderConfig.from_encoder_decoder_configs(encoder_config, decoder_config)
model = EncoderDecoderModel(config=config)

Here are the code for testing the model

model.eval()
input_ids = torch.tensor(tokenizer.encode(input_text)).unsqueeze(0)
output_ids = model.generate(input_ids.to('cuda'), decoder_start_token_id=model.config.decoder.pad_token_id)
output_text = tokenizer.decode(output_ids[0])

Example input: "iron cement is a ready for use paste which is laid as a fillet by putty knife or finger in the mould edges ( corners ) of the steel ingot mould ."

Ground truth translation: "iron cement ist eine gebrauchs ##AT##-##AT## fertige Paste , die mit einem Spachtel oder den Fingern als Hohlkehle in die Formecken ( Winkel ) der Stahlguss -Kokille aufgetragen wird ."

What the model outputs after trained 100 epochs: "[S] wenn sie den unten stehenden link anklicken, sehen sie ein video uber die erstellung ansprechender illustrationen in quarkxpress" which is totally nonesense.

Where is the problem?

@LysandreJik
Copy link
Member

Hello, thanks for opening an issue! We try to keep the github issues for bugs/feature requests.
Could you ask your question on the forum instead?

Thanks!

cc @patrickvonplaten who might have an idea.

@patrickvonplaten
Copy link
Contributor

This blog post should also help on how to fine-tune a warm-started Encoder-Decoder model: https://huggingface.co/blog/warm-starting-encoder-decoder . But as @LysandreJik said the forum is the better place to ask.

@zmf0507
Copy link

zmf0507 commented Dec 17, 2020

@patrickvonplaten the blog post mentions about a notebook link for machine translation task but on clicking, it redirects to the blog only. I think there might be some mistake while adding the notebook link. Can you please share the translation task notebook on WMT dataset?

@patrickvonplaten
Copy link
Contributor

Hey @zmf0507 - yeah I sadly haven't found the time yet to do this notebook

@zmf0507
Copy link

zmf0507 commented Dec 25, 2020

@patrickvonplaten please let me know here when you make one. Despite being so popular, hugging-face doesn't provide any tutorial/notebook for machine translation. I think a lot of people might be looking for similar resources. Will help much. Thanks

@patrickvonplaten
Copy link
Contributor

We have now one for mBart: https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb -> will try to make one for Encoder Decoder as well when I find time :-)

@zmf0507
Copy link

zmf0507 commented Dec 26, 2020

sure. thanks a lot :)

@zmf0507
Copy link

zmf0507 commented Feb 14, 2021

@patrickvonplaten is there any encoder-decoder notebook made for translation task ? thanks

@patrickvonplaten
Copy link
Contributor

I'm sadly not finding the time to do so at the moment :-/

I'll put this up as a "Good First Issue" now in case someone from the community finds time to make such a notebook.

A notebook for EncoderDecoderModel translation should look very similar to this notebook: https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Leveraging_Pre_trained_Checkpoints_for_Encoder_Decoder_Models.ipynb - one only has to change the summarization dataset with a translation dataset

@zmf0507
Copy link

zmf0507 commented Feb 19, 2021

@patrickvonplaten thanks for the update.
Can you tell if there is any work on keyphrase generation /keywords generation (seq2seq task) using hugging-face ? I am looking for such tutorials and examples where I can try and play around keyphrase generation. This task is not mentioned on hugging-face notebooks page as well.
Please let me know

@patrickvonplaten
Copy link
Contributor

My best advice would be to ask this question on the forum - I sadly don't know of any work related to this

@parambharat
Copy link
Contributor

@patrickvonplaten : Here's my attempt that modifies the condensed version of BERT2BERT.ipynb to use the wmt dataset, BLEU4 score for the en-de translation task.

@Nid989
Copy link

Nid989 commented Mar 22, 2022

We have now one for mBart: https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb -> will try to make one for Encoder-Decoder as well when I find time :-)

Inferring the model training details from BERT2BERT for CNN daily mail is not sufficient, we experimented with an MT model with the must-c data for en-fr , however the prediction were almost random and it was not able to understand the core meaning of its input sequence.

@Nid989
Copy link

Nid989 commented Mar 22, 2022

If anyone has a complete notebook based on the Encoder-Decoder model for MT, please share. Thank you.

@xueqianyi
Copy link

Has anyone performed the translation task correctly using bert2bert ? TAT

@patrickvonplaten
Copy link
Contributor

@xueqianyi - maybe you have more luck on https://discuss.huggingface.co/ ?

@ydshieh
Copy link
Collaborator

ydshieh commented Aug 23, 2022

Just an extra comment here: With bert2bert, it's not very helpful for MT, as BERT is only trained on English data.

@desaibhargav
Copy link

desaibhargav commented Aug 25, 2022

Hi there, I'm a Data Science grad student at Luddy. I was looking to contribute to open source in my free time and came across this issue. I did put a rough notebook together, linking it here @xueqianyi @CharizardAcademy. I would love to polish it to the standard upheld in the HF community if its indeed helpful.

Just some comments (I did NOT spend a lot of time on this, so your observations MIGHT differ):

  1. The translation quality depends a lot on model capacity, though even using base BERT, the translations are fairly decent and definitely not gibberish. Tweaking the decoding parameters will help too.

  2. I've trained only on 1M examples due to compute constraints, but I believe some multiples higher might work out better. I trained with 0.1M and 0.5M examples, I saw consistent improvements to the BLEU score on every increase.

  3. Length of the tensors fed into the model (post-tokenization) have an impact on the translation quality too. Specifically max_length=64 and higher results in a lot of repetitions especially for short sentences because this particular dataset (1M subset) has most examples below 32 tokens (95%) (hence I recommend spending sometime tweaking the decoding parameters, no_repeat_ngram_size, max_length, length_penality etc in particular).

  4. Also, the model seems to think President Obama and President Bush are the same person, EVERYTIME. xD

@mahita2104
Copy link

I would like to work on this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants