Discrepancy in MM Math Evaluation Results: Possible Issue with Answer Extraction in VLMEvalkit #638

mantle2048 · 2024-12-01T11:00:11Z

Hi, everyone.

I’ve noticed a significant discrepancy between the evaluation results of the MM Math dataset and the results reported in the original paper.

In the original MM Math paper, GPT-4o achieved an accuracy of 31.8%, but the evaluation result using VLMEvalkit is only 22.5%.

This difference doesn’t seem to be caused by randomness in the model's outputs. Upon reviewing the code, I found that VLMEvalkit uses the answer extraction code provided by the original MM Math repository.

However, the current code seems to match every occurrence of the answer enclosed in \boxed{} in the output (see this line). This may lead to incorrect judgments if the model outputs the correct answer multiple times.

I believe this might be the reason for the performance discrepancy compared to the original MM Math paper.

Additionally, since the answers in this dataset are open-ended, using an LLM for answer extraction and comparison might be a better option than the hardcoded matching approach.

kennymckormick · 2024-12-01T14:05:18Z

Hi, @mantle2048 ,
we will delve into the evaluation code and add some refinement. Would you please provide us some error cases for debugging usage?

mantle2048 · 2024-12-01T14:21:38Z

Sure, I would be happy to!

mantle2048 · 2024-12-02T03:16:57Z

A minimalist script:

scorer = AutoScoringJudge() # AutoScoring

solution = "$\\therefore$ k = \\boxed{-6}.$"
prediction = "$\\therefore$ k = \\boxed{-6}. \\\\\n k = \\boxed{-6}.$"
if "\\boxed{" in solution and "\\boxed{" in prediction:
    processed_solution, processed_prediction = scorer.preprocess(solution, prediction)
    print("processed_solution:", processed_solution)
    print("processed_prediction:", processed_prediction)
    judge = scorer.judge(solution, prediction)
    print("judge:", judge)

#================================
# Output   
# processed_solution:  -6
# processed_prediction:  -6,-6
# judge:  False

This should be a correct prediction, but scorer would extract all the contents within the \boxed, resulting in a wrong judgement.

kennymckormick · 2024-12-02T07:59:16Z

A minimalist script:

scorer = AutoScoringJudge() # AutoScoring

solution = "$\\therefore$ k = \\boxed{-6}.$"
prediction = "$\\therefore$ k = \\boxed{-6}. \\\\\n k = \\boxed{-6}.$"
if "\\boxed{" in solution and "\\boxed{" in prediction:
    processed_solution, processed_prediction = scorer.preprocess(solution, prediction)
    print("processed_solution:", processed_solution)
    print("processed_prediction:", processed_prediction)
    judge = scorer.judge(solution, prediction)
    print("judge:", judge)

#================================
# Output   
# processed_solution:  -6
# processed_prediction:  -6,-6
# judge:  False

This should be a correct prediction, but scorer would extract all the contents within the \boxed, resulting in a wrong judgement.

Hi, @mantle2048 ,
I'm afraid that the case you provided is not a representative one, I have checked the data but find that only in less than 0.2% cases, GPT-4o outputs more \boxed items than the groundtruth answer. Here I provide the raw predictions of GPT-4o on MM-Math for the reference:

GPT4o_20240806_MM-Math.xlsx

mantle2048 · 2024-12-02T08:08:10Z

It seems that I underestimated GPT-4o as a powerful model. The repeated output of \boxed is more common on weaker, smaller models.

However, the significant performance discrepancy compared to the results reported in the original paper is quite strange.

We can close this issue for now and reopen it once I have further findings.

kennymckormick · 2024-12-02T09:26:27Z

It seems that I underestimated GPT-4o as a powerful model. The repeated output of \boxed is more common on weaker, smaller models.

However, the significant performance discrepancy compared to the results reported in the original paper is quite strange.

We can close this issue for now and reopen it once I have further findings.

OK, I will also try to contact the paper authors for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy in MM Math Evaluation Results: Possible Issue with Answer Extraction in VLMEvalkit #638

Discrepancy in MM Math Evaluation Results: Possible Issue with Answer Extraction in VLMEvalkit #638

mantle2048 commented Dec 1, 2024

kennymckormick commented Dec 1, 2024

mantle2048 commented Dec 1, 2024

mantle2048 commented Dec 2, 2024

kennymckormick commented Dec 2, 2024

mantle2048 commented Dec 2, 2024

kennymckormick commented Dec 2, 2024

Discrepancy in MM Math Evaluation Results: Possible Issue with Answer Extraction in VLMEvalkit #638

Discrepancy in MM Math Evaluation Results: Possible Issue with Answer Extraction in VLMEvalkit #638

Comments

mantle2048 commented Dec 1, 2024

kennymckormick commented Dec 1, 2024

mantle2048 commented Dec 1, 2024

mantle2048 commented Dec 2, 2024

kennymckormick commented Dec 2, 2024

mantle2048 commented Dec 2, 2024

kennymckormick commented Dec 2, 2024