Skip to content

Commit aa9f50e

Browse files
[Refactor] hide the video dataset related args (#675)
* [Refactor] merge the video dataset related args into config json and each dataset inside * fix the concat dataset problem * update the build_model_from_config with empty dict * add supported_video_datasets function for quick start * update on result_file_name problem * fix lint * update configSystem doc and quickStart doc
1 parent 2fd7140 commit aa9f50e

16 files changed

+332
-290
lines changed

docs/en/ConfigSystem.md

+15-5
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Config System
22

3-
By default, VLMEvalKit launches the evaluation by setting the model name(s) (defined in `/vlmeval/config.py`) and dataset name(s) (defined in `vlmeval/dataset/__init__.py`) in the `run.py` script with the `--model` and `--data` arguments. Such approach is simple and efficient in most scenarios, however, it may not be flexible enough when the user wants to evaluate multiple models / datasets with different settings.
3+
By default, VLMEvalKit launches the evaluation by setting the model name(s) (defined in `/vlmeval/config.py`) and dataset name(s) (defined in `vlmeval/dataset/__init__.py` or `vlmeval/dataset/video_dataset_config.py`) in the `run.py` script with the `--model` and `--data` arguments. Such approach is simple and efficient in most scenarios, however, it may not be flexible enough when the user wants to evaluate multiple models / datasets with different settings.
44

55
To address this, VLMEvalKit provides a more flexible config system. The user can specify the model and dataset settings in a json file, and pass the path to the config file to the `run.py` script with the `--config` argument. Here is a sample config json:
66

@@ -18,7 +18,8 @@ To address this, VLMEvalKit provides a more flexible config system. The user can
1818
"model": "gpt-4o-2024-08-06",
1919
"temperature": 1.0,
2020
"img_detail": "low"
21-
}
21+
},
22+
"GPT4o_20241120": {}
2223
},
2324
"data": {
2425
"MME-RealWorld-Lite": {
@@ -28,7 +29,14 @@ To address this, VLMEvalKit provides a more flexible config system. The user can
2829
"MMBench_DEV_EN_V11": {
2930
"class": "ImageMCQDataset",
3031
"dataset": "MMBench_DEV_EN_V11"
31-
}
32+
},
33+
"MMBench_Video_8frame_nopack":{},
34+
"Video-MME_16frame_subs": {
35+
"class": "VideoMME",
36+
"dataset": "Video-MME",
37+
"nframe": 16,
38+
"use_subtitle": true
39+
},
3240
}
3341
}
3442
```
@@ -39,10 +47,11 @@ Explanation of the config json:
3947
2. For items in `model`, the value is a dictionary containing the following keys:
4048
- `class`: The class name of the model, which should be a class name defined in `vlmeval/vlm/__init__.py` (open-source models) or `vlmeval/api/__init__.py` (API models).
4149
- Other kwargs: Other kwargs are model-specific parameters, please refer to the definition of the model class for detailed usage. For example, `model`, `temperature`, `img_detail` are arguments of the `GPT4V` class. It's noteworthy that the `model` argument is required by most model classes.
50+
- Tip: The defined model in the `supported_VLM` of `vlmeval/config.py` can be used as a shortcut, for example, `GPT4o_20241120: {}` is equivalent to `GPT4o_20241120: {'class': 'GPT4V', 'model': 'gpt-4o-2024-11-20', 'temperature': 0, 'img_size': -1, 'img_detail': 'high', 'retry': 10, 'verbose': False}`
4251
3. For the dictionary `data`, we suggest users to use the official dataset name as the key (or part of the key), since we frequently determine the post-processing / judging settings based on the dataset name. For items in `data`, the value is a dictionary containing the following keys:
4352
- `class`: The class name of the dataset, which should be a class name defined in `vlmeval/dataset/__init__.py`.
44-
- Other kwargs: Other kwargs are dataset-specific parameters, please refer to the definition of the dataset class for detailed usage. Typically, the `dataset` argument is required by most dataset classes.
45-
53+
- Other kwargs: Other kwargs are dataset-specific parameters, please refer to the definition of the dataset class for detailed usage. Typically, the `dataset` argument is required by most dataset classes. It's noteworthy that the `nframe` argument or `fps` argument is required by most video dataset classes.
54+
- Tip: The defined dataset in the `supported_video_datasets` of `vlmeval/dataset/video_dataset_config.py` can be used as a shortcut, for example, `MMBench_Video_8frame_nopack: {}` is equivalent to `MMBench_Video_8frame_nopack: {'class': 'MMBenchVideo', 'dataset': 'MMBench-Video', 'nframe': 8, 'pack': False}`.
4655
Saving the example config json to `config.json`, you can launch the evaluation by:
4756

4857
```bash
@@ -55,3 +64,4 @@ That will generate the following output files under the working directory `$WORK
5564
- `$WORK_DIR/GPT4o_20240806_T10_Low/GPT4o_20240806_T10_Low_MME-RealWorld-Lite*`
5665
- `$WORK_DIR/GPT4o_20240806_T00_HIGH/GPT4o_20240806_T00_HIGH_MMBench_DEV_EN_V11*`
5766
- `$WORK_DIR/GPT4o_20240806_T10_Low/GPT4o_20240806_T10_Low_MMBench_DEV_EN_V11*`
67+
...

docs/en/Quickstart.md

+4-6
Original file line numberDiff line numberDiff line change
@@ -68,8 +68,6 @@ We use `run.py` for evaluation. To use the script, you can use `$VLMEvalKit/run.
6868
- `--mode (str, default to 'all', choices are ['all', 'infer'])`: When `mode` set to "all", will perform both inference and evaluation; when set to "infer", will only perform the inference.
6969
- `--nproc (int, default to 4)`: The number of threads for OpenAI API calling.
7070
- `--work-dir (str, default to '.')`: The directory to save evaluation results.
71-
- `--nframe (int, default to 8)`: The number of frames to sample from a video, only applicable to the evaluation of video benchmarks.
72-
- `--pack (bool, store_true)`: A video may associate with multiple questions, if `pack==True`, will ask all questions for a video in a single query.
7371

7472
**Command for Evaluating Image Benchmarks **
7573

@@ -99,10 +97,10 @@ torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
9997
# When running with `python`, only one VLM instance is instantiated, and it might use multiple GPUs (depending on its default behavior).
10098
# That is recommended for evaluating very large VLMs (like IDEFICS-80B-Instruct).
10199

102-
# IDEFICS2-8B on MMBench-Video, with 8 frames as inputs and vanilla evaluation. On a node with 8 GPUs.
103-
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model idefics2_8b --nframe 8
104-
# GPT-4o (API model) on MMBench-Video, with 16 frames as inputs and pack evaluation (all questions of a video in a single query).
105-
python run.py --data MMBench-Video --model GPT4o --nframe 16 --pack
100+
# IDEFICS2-8B on MMBench-Video, with 8 frames as inputs and vanilla evaluation. On a node with 8 GPUs. MMBench_Video_8frame_nopack is a defined dataset setting in `vlmeval/dataset/video_dataset_config.py`.
101+
torchrun --nproc-per-node=8 run.py --data MMBench_Video_8frame_nopack --model idefics2_8
102+
# GPT-4o (API model) on MMBench-Video, with 1 frame per second as inputs and pack evaluation (all questions of a video in a single query).
103+
python run.py --data MMBench_Video_1fps_pack --model GPT4o
106104
```
107105

108106
The evaluation results will be printed as logs, besides. **Result Files** will also be generated in the directory `$YOUR_WORKING_DIRECTORY/{model_name}`. Files ending with `.csv` contain the evaluated metrics.

docs/zh-CN/ConfigSystem.md

+15-5
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11

22
# 配置系统
33

4-
默认情况下,VLMEvalKit通过在`run.py`脚本中使用`--model``--data`参数设置模型名称(在`/vlmeval/config.py`中定义)和数据集名称(在`vlmeval/dataset/__init__.py`中定义)来启动评估。这种方法在大多数情况下简单且高效,但当用户希望使用不同设置评估多个模型/数据集时,可能不够灵活。
4+
默认情况下,VLMEvalKit通过在`run.py`脚本中使用`--model``--data`参数设置模型名称(在`/vlmeval/config.py`中定义)和数据集名称(在`vlmeval/dataset/__init__.py``vlmeval/dataset/video_dataset_config.py` 中定义)来启动评估。这种方法在大多数情况下简单且高效,但当用户希望使用不同设置评估多个模型/数据集时,可能不够灵活。
55

66
为了解决这个问题,VLMEvalKit提供了一个更灵活的配置系统。用户可以在json文件中指定模型和数据集设置,并通过`--config`参数将配置文件的路径传递给`run.py`脚本。以下是一个示例配置json:
77

@@ -19,7 +19,8 @@
1919
"model": "gpt-4o-2024-08-06",
2020
"temperature": 1.0,
2121
"img_detail": "low"
22-
}
22+
},
23+
"GPT4o_20241120": {}
2324
},
2425
"data": {
2526
"MME-RealWorld-Lite": {
@@ -29,7 +30,14 @@
2930
"MMBench_DEV_EN_V11": {
3031
"class": "ImageMCQDataset",
3132
"dataset": "MMBench_DEV_EN_V11"
32-
}
33+
},
34+
"MMBench_Video_8frame_nopack":{},
35+
"Video-MME_16frame_subs": {
36+
"class": "VideoMME",
37+
"dataset": "Video-MME",
38+
"nframe": 16,
39+
"use_subtitle": true
40+
},
3341
}
3442
}
3543
```
@@ -40,9 +48,11 @@
4048
2. 对于`model`中的项目,值是一个包含以下键的字典:
4149
- `class`:模型的类名,应该是`vlmeval/vlm/__init__.py`(开源模型)或`vlmeval/api/__init__.py`(API模型)中定义的类名。
4250
- 其他kwargs:其他kwargs是模型特定的参数,请参考模型类的定义以获取详细用法。例如,`model``temperature``img_detail``GPT4V`类的参数。值得注意的是,大多数模型类都需要`model`参数。
51+
- Tip:在位于`vlmeval/config.py`的变量`supported_VLM`中的已经被定义的模型可以作为`model`的键,而不需要填对应的值即可启动。例如,`GPT4o_20240806_T00_HIGH: {}`是等价于`GPT4o_20240806_T00_HIGH: {'class': 'GPT4V', 'model': 'gpt-4o-2024-08-06', 'temperature': 0, 'img_size': -1, 'img_detail': 'high', 'retry': 10, 'verbose': False}`
4352
3. 对于字典`data`,我们建议用户使用官方数据集名称作为键(或键的一部分),因为我们经常根据数据集名称确定后处理/判断设置。对于`data`中的项目,值是一个包含以下键的字典:
4453
- `class`:数据集的类名,应该是`vlmeval/dataset/__init__.py`中定义的类名。
45-
- 其他kwargs:其他kwargs是数据集特定的参数,请参考数据集类的定义以获取详细用法。通常,大多数数据集类都需要`dataset`参数。
54+
- 其他kwargs:其他kwargs是数据集特定的参数,请参考数据集类的定义以获取详细用法。通常,大多数数据集类都需要`dataset`参数。大多数视频数据集类都需要 `nframe``fps` 参数。
55+
- Tip:在位于`vlmeval/dataset/video_dataset_config.py`的变量`supported_video_dataset`中的已经被定义的数据集可以作为`data`的键,而不需要填对应的值即可启动。例如,`MMBench_Video_8frame_nopack: {}`是等价于`MMBench_Video_8frame_nopack: {'class': 'MMBenchVideo', 'dataset': 'MMBench-Video', 'nframe': 8, 'pack': False}`
4656

4757
将示例配置json保存为`config.json`,您可以通过以下命令启动评估:
4858

@@ -56,4 +66,4 @@ python run.py --config config.json
5666
- `$WORK_DIR/GPT4o_20240806_T10_Low/GPT4o_20240806_T10_Low_MME-RealWorld-Lite*`
5767
- `$WORK_DIR/GPT4o_20240806_T00_HIGH/GPT4o_20240806_T00_HIGH_MMBench_DEV_EN_V11*`
5868
- `$WORK_DIR/GPT4o_20240806_T10_Low/GPT4o_20240806_T10_Low_MMBench_DEV_EN_V11*`
59-
-
69+
......

docs/zh-CN/Quickstart.md

+4-6
Original file line numberDiff line numberDiff line change
@@ -67,8 +67,6 @@ pip install -e .
6767
- `--mode (str, 默认值为 'all', 可选值为 ['all', 'infer'])`:当 mode 设置为 "all" 时,将执行推理和评估;当设置为 "infer" 时,只执行推理
6868
- `--nproc (int, 默认值为 4)`: 调用 API 的线程数
6969
- `--work-dir (str, default to '.')`: 存放测试结果的目录
70-
- `--nframe (int, default to 8)`: 从视频中采样的帧数,仅对视频多模态评测集适用
71-
- `--pack (bool, store_true)`: 一个视频可能关联多个问题,如 `pack==True`,将会在一次询问中提问所有问题
7270

7371
**用于评测图像多模态评测集的命令**
7472

@@ -98,10 +96,10 @@ torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
9896
# 使用 `python` 运行时,只实例化一个 VLM,并且它可能使用多个 GPU。
9997
# 这推荐用于评估参数量非常大的 VLMs(如 IDEFICS-80B-Instruct)。
10098

101-
# 在 MMBench-Video 上评测 IDEFCIS2-8B, 视频采样 8 帧作为输入,不采用 pack 模式评测
102-
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model idefics2_8b --nframe 8
103-
# 在 MMBench-Video 上评测 GPT-4o (API 模型), 视频采样 16 帧作为输入,采用 pack 模式评测
104-
python run.py --data MMBench-Video --model GPT4o --nframe 16 --pack
99+
# 在 MMBench-Video 上评测 IDEFCIS2-8B, 视频采样 8 帧作为输入,不采用 pack 模式评测. MMBench_Video_8frame_nopack 是一个定义在 `vlmeval/dataset/video_dataset_config.py` 的数据集设定.
100+
torchrun --nproc-per-node=8 run.py --data MMBench_Video_8frame_nopack --model idefics2_8
101+
# 在 MMBench-Video 上评测 GPT-4o (API 模型), 视频采样每秒一帧作为输入,采用 pack 模式评测
102+
python run.py --data MMBench_Video_1fps_pack --model GPT4o
105103
```
106104

107105
评估结果将作为日志打印出来。此外,**结果文件**也会在目录 `$YOUR_WORKING_DIRECTORY/{model_name}` 中生成。以 `.csv` 结尾的文件包含评估的指标。

0 commit comments

Comments
 (0)