Skip to content

Commit

Permalink
[Added/FIxed] added info of NaturalBench in README.md and fixed the m…
Browse files Browse the repository at this point in the history
…etric names for NaturalBench Dataset (#660)

* [Fixed] the metric name in NaturalBench

* [Added] info of NaturalBench in README.md

---------

Co-authored-by: Haodong Duan <[email protected]>
  • Loading branch information
Baiqi-Li and kennymckormick authored Dec 17, 2024
1 parent 5787b85 commit 5316c5f
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 23 deletions.
4 changes: 1 addition & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ English | [简体中文](/docs/zh-CN/README_zh-CN.md) | [日本語](/docs/ja/REA

> We have presented a [**comprehensive survey**](https://arxiv.org/pdf/2411.15296) on the evaluation of large multi-modality models, jointly with [**MME Team**](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models) and [**LMMs-Lab**](https://lmms-lab.github.io) 🔥🔥🔥
- **[2024-12-11]** Supported **[NaturalBench](https://huggingface.co/datasets/BaiqiL/NaturalBench)**, a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with simple questions about natural imagery.
- **[2024-12-02]** Supported [VisOnlyQA](https://github.com/psunlpgroup/VisOnlyQA/), a benchmark for evaluating the visual perception capabilities 🔥🔥🔥
- **[2024-11-26]** Supported [Ovis1.6-Gemma2-27B](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-27B), thanks to **[runninglsy](https://github.com/runninglsy)** 🔥🔥🔥
- **[2024-11-25]** Create a new flag `VLMEVALKIT_USE_MODELSCOPE`. By setting this environment variable, you can download the video benchmarks supported from **[modelscope](https://www.modelscope.cn)** 🔥🔥🔥
Expand All @@ -37,9 +38,6 @@ English | [简体中文](/docs/zh-CN/README_zh-CN.md) | [日本語](/docs/ja/REA
- **[2024-11-21]** Integrated a new config system to enable more flexible evaluation settings. Check the [Document](/docs/en/ConfigSystem.md) or run `python run.py --help` for more details 🔥🔥🔥
- **[2024-11-21]** Supported **[QSpatial](https://andrewliao11.github.io/spatial_prompt/)**, a multimodal benchmark for Quantitative Spatial Reasoning (determine the size / distance, e.g.), thanks **[andrewliao11](https://github.com/andrewliao11)** for providing the official support 🔥🔥🔥
- **[2024-11-21]** Supported **[MM-Math](https://github.com/kge-sun/mm-math)**, a new multimodal math benchmark comprising of ~6K middle school multi-modal reasoning math problems. GPT-4o-20240806 achieces 22.5% accuracy on this benchmark 🔥🔥🔥
- **[2024-11-16]** Supported **[OlympiadBench](https://github.com/OpenBMB/OlympiadBench)**, a new multimodal benchmark comprising olympiad-level math and physics questions 🔥🔥🔥
- **[2024-11-16]** Supported **[WildVision](https://huggingface.co/datasets/WildVision/wildvision-bench)**, a new subjective multimodal benchmark derived from multi-modal arena data 🔥🔥🔥
- **[2024-11-13]** Supported **[MIA-Bench](https://arxiv.org/abs/2407.01509)**, a multimodal instruction-following benchmark 🔥🔥🔥

## 🏗️ QuickStart

Expand Down
40 changes: 20 additions & 20 deletions vlmeval/dataset/utils/naturalbench.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,15 +60,15 @@ def get_scores(scores):
Returns:
dict: A dictionary containing the calculated scores:
- 'question_score': Average question score
- 'image_score': Average image score
- 'binary_score': Average binary VQA score
- 'group_score': Average group score
- 'Q_Acc': Average question score
- 'I_Acc': Average image score
- 'Acc': Average binary VQA score
- 'G_Acc': Average group score
"""
question_score = 0.0
image_score = 0.0
binary_score = 0.0
group = 0.0
Q_Acc = 0.0
I_Acc = 0.0
Acc = 0.0
G_Acc = 0.0

num_samples = len(scores)

Expand Down Expand Up @@ -124,22 +124,22 @@ def calculate_group(result):

if isinstance(scores, dict):
for _, result in scores.items():
question_score += calculate_question_score(result)
image_score += calculate_image_score(result)
binary_score += calculate_binary_score(result)
group += calculate_group(result)
Q_Acc += calculate_question_score(result)
I_Acc += calculate_image_score(result)
Acc += calculate_binary_score(result)
G_Acc += calculate_group(result)
else:
for result in scores:
question_score += calculate_question_score(result)
image_score += calculate_image_score(result)
binary_score += calculate_binary_score(result)
group += calculate_group(result)
Q_Acc += calculate_question_score(result)
I_Acc += calculate_image_score(result)
Acc += calculate_binary_score(result)
G_Acc += calculate_group(result)

results = {
'question_score': question_score / float(num_samples * 2),
'image_score': image_score / float(num_samples * 2),
'binary_score': binary_score / float(num_samples * 4),
'group_score': group / num_samples
'Q_Acc': Q_Acc / float(num_samples * 2),
'I_Acc': I_Acc / float(num_samples * 2),
'Acc': Acc / float(num_samples * 4),
'G_Acc': G_Acc / num_samples
}

return results

0 comments on commit 5316c5f

Please sign in to comment.