Not able to reproduce InternVL-8b x Blink result #649

David-BominWei · 2024-12-05T07:29:07Z

Hi, I tried to reproduce the Blink evaluation result. The result I got is different from the result on the leaderboard and InternVL documentation.
您好，我在试图复刻Blink数据集的结果的时候发现了有0.8%的差异，请问这个差异是否来源于ChatGPT的不同版本

Here are the commands I used for evaluation:

torchrun --nproc-per-node=8 run.py --model InternVL2-8B --data BLINK

This is the result I got:

-------------------------  -------------------
split                      none
Overall                    0.5002630194634403
Art_Style                  0.6495726495726496
Counting                   0.7166666666666667
Forensic_Detection         0.3787878787878788
Functional_Correspondence  0.16923076923076924
IQ_Test                    0.32
Jigsaw                     0.6133333333333333
Multi-view_Reasoning       0.43609022556390975
Object_Localization        0.5655737704918032
Relative_Depth             0.7419354838709677
Relative_Reflectance       0.39552238805970147
Semantic_Correspondence    0.2014388489208633
Spatial_Relation           0.8041958041958042
Visual_Correspondence      0.3430232558139535
Visual_Similarity          0.762962962962963
-------------------------  -------------------

This is the reported result:

-------------------------  -------------------
split                      none
Overall                    0.5086796422935297
Art_Style                  0.7094017094017094
Counting                   0.75
Forensic_Detection         0.3484848484848485
Functional_Correspondence  0.17692307692307693
IQ_Test                    0.30666666666666664
Jigsaw                     0.5466666666666666
Multi-view_Reasoning       0.48872180451127817
Object_Localization        0.5573770491803278
Relative_Depth             0.7419354838709677
Relative_Reflectance       0.39552238805970147
Semantic_Correspondence    0.26618705035971224
Spatial_Relation           0.7972027972027972
Visual_Correspondence      0.36046511627906974
Visual_Similarity          0.7851851851851852
-------------------------  -------------------

My transformers version is transformers==4.37.0

My nvcc version is

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

The text was updated successfully, but these errors were encountered:

czczup · 2024-12-09T10:53:44Z

Hello, indeed, I just reproduced the test and achieved a score of 55.0, but the test log from a few months ago shows a score of 50.9. I'm not sure what happened in between.

I often encounter situations where I can't reproduce the old score after a few months. 😭

David-BominWei · 2024-12-12T01:51:07Z

Hello, indeed, I just reproduced the test and achieved a score of 55.0, but the test log from a few months ago shows a score of 50.9. I'm not sure what happened in between.

I often encounter situations where I can't reproduce the old score after a few months. 😭

Thank you for your reply. Could you give me some more details about reproducing the 55.0 results? (e.g. the command for running the test or the actual output from InternVL(?)) I wonder if it is a ChatGPT version issue or something else. Thank you VERY much for your help

czczup · 2024-12-12T05:44:33Z

InternVL2-8B_BLINK.zip

Hello, here are my evaluation results. One is from a test conducted several months ago (50.9), and the other is from today's test (50.4). A couple of days ago, I got a score of 50.0 during testing, but I have already deleted that log.

czczup · 2024-12-12T05:46:07Z

My cmd is:

torchrun --nproc-per-node=8 run.py --data BLINK --model InternVL2-8B

Also worth noting is that I configured the OpenAI key.

David-BominWei · 2024-12-12T05:56:12Z

Got it, thank you VERY much!

…

-- 发自我的网易邮箱平板适配版在 2024-12-12 13:44:55，"Zhe Chen" ***@***.***> 写道： InternVL2-8B_BLINK.zip Hello, here are my evaluation results. One is from a test conducted several months ago (50.9), and the other is from today's test (50.4). A couple of days ago, I got a score of 50.0 during testing, but I have already deleted that log. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to reproduce InternVL-8b x Blink result #649

Not able to reproduce InternVL-8b x Blink result #649

David-BominWei commented Dec 5, 2024

czczup commented Dec 9, 2024 •

edited

Loading

David-BominWei commented Dec 12, 2024

czczup commented Dec 12, 2024

czczup commented Dec 12, 2024

David-BominWei commented Dec 12, 2024 via email

Not able to reproduce InternVL-8b x Blink result #649

Not able to reproduce InternVL-8b x Blink result #649

Comments

David-BominWei commented Dec 5, 2024

czczup commented Dec 9, 2024 • edited Loading

David-BominWei commented Dec 12, 2024

czczup commented Dec 12, 2024

czczup commented Dec 12, 2024

David-BominWei commented Dec 12, 2024 via email

czczup commented Dec 9, 2024 •

edited

Loading