Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in MM Math Evaluation Results: Possible Issue with Answer Extraction in VLMEvalkit #638

Open
mantle2048 opened this issue Dec 1, 2024 · 6 comments

Comments

@mantle2048
Copy link

Hi, everyone.

I’ve noticed a significant discrepancy between the evaluation results of the MM Math dataset and the results reported in the original paper.

In the original MM Math paper, GPT-4o achieved an accuracy of 31.8%, but the evaluation result using VLMEvalkit is only 22.5%.

This difference doesn’t seem to be caused by randomness in the model's outputs. Upon reviewing the code, I found that VLMEvalkit uses the answer extraction code provided by the original MM Math repository.

However, the current code seems to match every occurrence of the answer enclosed in \boxed{} in the output (see this line). This may lead to incorrect judgments if the model outputs the correct answer multiple times.

I believe this might be the reason for the performance discrepancy compared to the original MM Math paper.

Additionally, since the answers in this dataset are open-ended, using an LLM for answer extraction and comparison might be a better option than the hardcoded matching approach.

@kennymckormick
Copy link
Member

Hi, @mantle2048 ,
we will delve into the evaluation code and add some refinement. Would you please provide us some error cases for debugging usage?

@mantle2048
Copy link
Author

Sure, I would be happy to!

@mantle2048
Copy link
Author

A minimalist script:

scorer = AutoScoringJudge() # AutoScoring

solution = "$\\therefore$ k = \\boxed{-6}.$"
prediction = "$\\therefore$ k = \\boxed{-6}. \\\\\n k = \\boxed{-6}.$"
if "\\boxed{" in solution and "\\boxed{" in prediction:
    processed_solution, processed_prediction = scorer.preprocess(solution, prediction)
    print("processed_solution:", processed_solution)
    print("processed_prediction:", processed_prediction)
    judge = scorer.judge(solution, prediction)
    print("judge:", judge)

#================================
# Output   
# processed_solution:  -6
# processed_prediction:  -6,-6
# judge:  False

This should be a correct prediction, but scorer would extract all the contents within the \boxed, resulting in a wrong judgement.

@kennymckormick
Copy link
Member

A minimalist script:

scorer = AutoScoringJudge() # AutoScoring

solution = "$\\therefore$ k = \\boxed{-6}.$"
prediction = "$\\therefore$ k = \\boxed{-6}. \\\\\n k = \\boxed{-6}.$"
if "\\boxed{" in solution and "\\boxed{" in prediction:
    processed_solution, processed_prediction = scorer.preprocess(solution, prediction)
    print("processed_solution:", processed_solution)
    print("processed_prediction:", processed_prediction)
    judge = scorer.judge(solution, prediction)
    print("judge:", judge)

#================================
# Output   
# processed_solution:  -6
# processed_prediction:  -6,-6
# judge:  False

This should be a correct prediction, but scorer would extract all the contents within the \boxed, resulting in a wrong judgement.

Hi, @mantle2048 ,
I'm afraid that the case you provided is not a representative one, I have checked the data but find that only in less than 0.2% cases, GPT-4o outputs more \boxed items than the groundtruth answer. Here I provide the raw predictions of GPT-4o on MM-Math for the reference:

GPT4o_20240806_MM-Math.xlsx

@mantle2048
Copy link
Author

It seems that I underestimated GPT-4o as a powerful model. The repeated output of \boxed is more common on weaker, smaller models.

However, the significant performance discrepancy compared to the results reported in the original paper is quite strange.

We can close this issue for now and reopen it once I have further findings.

@kennymckormick
Copy link
Member

It seems that I underestimated GPT-4o as a powerful model. The repeated output of \boxed is more common on weaker, smaller models.

However, the significant performance discrepancy compared to the results reported in the original paper is quite strange.

We can close this issue for now and reopen it once I have further findings.

OK, I will also try to contact the paper authors for more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants