Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we need a config to change padding_side='left before the evaluation? #31672

Open
gary-young opened this issue Jun 27, 2024 · 3 comments
Open
Labels
Feature request Request for a new feature

Comments

@gary-young
Copy link

Feature request

I am trying to train a Llama model (a decoder-only model). I want to evaluate my model with not only the loss but also some generation-based metric. For example, my eval dataset could be a str as 1+2=, and I use the Seq2seqTrainer which provides the modified prediction step so I can get the prediction of the model in the EvalPrediction. Then I write my eval code in the function compute_metrics and provide it for the Seq2seqTrainer.

The problem is about the padding_side of the tokenizer. Because I need to train the model, the tokenizer should be right padding in training dataset. (Because it is the default setting of Llama.) However, when I try to evaluate the model, the tokenizer should be changed into left padding because I need my model to generate. I do not find a easy way to do this, unless I change the source code of the trainer (for example, the get_eval_dataloader method of the Trainer).

My questions are:

  1. Is it correct way to evaluate a decoder-only model in a generation-based way? Should I use the Seq2seqTrainer or is there some other methods I have not found? (Is there an example doc?)
  2. Can I just train a model with right padding but evaluate it with left padding? If not, how should I evaluate models like Llama?
  3. If my evaluate process is correct, how can I change the padding_side as right at the begining of the evaluation and change it back to left after the evaluation? (I think if we have the seperated training_data_coallotor and test_data_coallotor, the problem could be solved. Is it possbile for the current transformers Trainer? Or any other way to implement it?)

Motivation

Motivation: generation-based evaluation when we train a decoder-only autoregressive model like llama.

Your contribution

I do not know what I can help.

@gary-young gary-young added the Feature request Request for a new feature label Jun 27, 2024
@zucchini-nlp
Copy link
Member

Hey!

Compute metrics with generation for decoder-only models does not work currently. See #26474 and the linked issues requesting the feature.

I am planing to work on it next week :)

@gary-young
Copy link
Author

@zucchini-nlp Thank you! Now I implement it by:

  1. remove the answer part in the valid (and test) dataset,
  2. use the Seq2seqTrainer instead of the Trainer, which modified the prediction_step function and their output contain the prediction. (It is tricky because it replace the original logits with the generated token_ids.)
  3. Then I get the prediciton and calculate my metrics in my own compute_metrics function.

From now on, it seems work. Now I solve the padding_side problem by also modifying the get_test_dataloader and get_eval_dataloader function to change the dataloader (more specifically, the data_coallator function.)

I am not sure it is the correct way to implement generation-based evaluation but it seems work.

@gary-young
Copy link
Author

@zucchini-nlp Oh, but my implement has a new problem, because the input sequences have been manually trancated, the eval_loss does not make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants