Ovis1.5-Llama3-8B在Hallusion Bench上的指标和榜单上的指标差距过大 #595

LIRENDA621 · 2024-11-13T07:31:51Z

1、OpenCompass排行榜的指标是45，但是我们本地测试只有41.30
2、这个差距不是由评判模型造成的。因为需要评判模型处理的'unknown'预测只有14个问题，而这14个问题本身就不是Yes/No问题，我参考了官方给出的预测结果，这14个问题同样回答错误。

kennymckormick · 2024-11-13T12:36:15Z

Hi, @LIRENDA621 ,
I have re-evaluated this model (torch2.4+cu121, transformers==4.46.2), and got an accuracy of ~42.3%, which looks inferior to previous evaluation results. However, we are not sure whether it's due to randomness.

We will re-evaluate this model soon to see if all evaluation results are significant different. If so, we will update the leaderboard and OpenVLMRecords. You can also find the prediction files corresponding to the 45% average accuracy in https://huggingface.co/datasets/VLMEval/OpenVLMRecords and check if there is some problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ovis1.5-Llama3-8B在Hallusion Bench上的指标和榜单上的指标差距过大 #595

Ovis1.5-Llama3-8B在Hallusion Bench上的指标和榜单上的指标差距过大 #595

LIRENDA621 commented Nov 13, 2024

kennymckormick commented Nov 13, 2024

Ovis1.5-Llama3-8B在Hallusion Bench上的指标和榜单上的指标差距过大 #595

Ovis1.5-Llama3-8B在Hallusion Bench上的指标和榜单上的指标差距过大 #595

Comments

LIRENDA621 commented Nov 13, 2024

kennymckormick commented Nov 13, 2024