Skip to content

Commit db460d1

Browse files
update
1 parent ac66926 commit db460d1

File tree

3 files changed

+78
-62
lines changed

3 files changed

+78
-62
lines changed

Custom_Benchmark_and_Model.md renamed to Development.md

+16
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
## Implement a new benchmark
44

5+
Example PR: **Add OCRBench** ([#91](https://github.com/open-compass/VLMEvalKit/pull/91/files))
6+
57
Currently, we organize a benchmark as one single TSV file. During inference, the data file will be automatically downloaded to `$LMUData` (default path is `$HOME/LMUData`, if not set explicitly). All existing benchmark TSV files are handled by `TSVDataset` implemented in `vlmeval/utils/dataset_config.py`.
68

79
| Dataset Name \ Fields | index | image | image_path | question | hint | multi-choice<br>options | answer | category | l2-category | split |
@@ -31,6 +33,20 @@ Besides, your dataset class **should implement the method `build_prompt(self, li
3133

3234
## Implement a new model
3335

36+
Example PR: **Support Monkey** ([#45](https://github.com/open-compass/VLMEvalKit/pull/45/files))
37+
3438
All existing models are implemented in `vlmeval/vlm`. For a minimal model, your model class **should implement the method** `generate(image_path, prompt, dataset=None)`. In this function, you feed the image and prompt to your VLM and return the VLM prediction (which is a string). The optional argument `dataset` can be used as the flag for the model to switch among various inference strategies.
3539

3640
Besides, your model can support custom prompt building by implementing an optional method `build_prompt(line, dataset=None)`. In this function, the line is a dictionary that includes the necessary information of a data sample, while `dataset` can be used as the flag for the model to switch among various prompt building strategies.
41+
42+
## Contribute to VLMEvalKit
43+
44+
If you want to contribute codes to **VLMEvalKit**, please do the pre-commit check before you submit a PR. That helps to keep the code tidy.
45+
46+
```bash
47+
# Under the directory of VLMEvalKit, install the pre-commit hook:
48+
pip install pre-commit
49+
pre-commit install
50+
pre-commit run --all-files
51+
# Then you can commit your code.
52+
```

Quickstart.md

+58
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Quickstart
2+
3+
Before running the evaluation script, you need to **configure** the VLMs and set the model_paths properly.
4+
5+
After that, you can use a single script `run.py` to inference and evaluate multiple VLMs and benchmarks at a same time.
6+
7+
## Step0. Installation
8+
9+
```bash
10+
git clone https://github.com/open-compass/VLMEvalKit.git
11+
cd VLMEvalKit
12+
pip install -e .
13+
```
14+
15+
## Step1. Configuration
16+
17+
**VLM Configuration**: All VLMs are configured in `vlmeval/config.py`, for some VLMs, you need to configure the code root (MiniGPT-4, PandaGPT, etc.) or the model_weight root (LLaVA-v1-7B, etc.) before conducting the evaluation. During evaluation, you should use the model name specified in `supported_VLM` in `vlmeval/config.py` to select the VLM. For MiniGPT-4 and InstructBLIP, you also need to modify the config files in `vlmeval/vlm/misc` to configure LLM path and ckpt path.
18+
19+
Following VLMs require the configuration step:
20+
21+
**Code Preparation & Installation**: InstructBLIP ([LAVIS](https://github.com/salesforce/LAVIS)), LLaVA ([LLaVA](https://github.com/haotian-liu/LLaVA)), MiniGPT-4 ([MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4)), mPLUG-Owl2 ([mPLUG-Owl2](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2)), OpenFlamingo-v2 ([OpenFlamingo](https://github.com/mlfoundations/open_flamingo)), PandaGPT-13B ([PandaGPT](https://github.com/yxuansu/PandaGPT)), TransCore-M ([TransCore-M](https://github.com/PCIResearch/TransCore-M)).
22+
23+
**Manual Weight Preparation & Configuration**: InstructBLIP, LLaVA-v1-7B, MiniGPT-4, PandaGPT-13B
24+
25+
## Step2. Evaluation
26+
27+
We use `run.py` for evaluation. To use the script, you can use `$VLMEvalKit/run.py` or create a soft-link of the script (to use the script anywhere):
28+
29+
**Arguments**
30+
31+
- `--data (list[str])`: Set the dataset names that are supported in VLMEvalKit (defined in `vlmeval/utils/dataset_config.py`).
32+
- `--model (list[str])`: Set the VLM names that are supported in VLMEvalKit (defined in `supported_VLM` in `vlmeval/config.py`).
33+
- `--mode (str, default to 'all', choices are ['all', 'infer'])`: When `mode` set to "all", will perform both inference and evaluation; when set to "infer", will only perform the inference.
34+
- `--nproc (int, default to 4)`: The number of threads for OpenAI API calling.
35+
36+
**Command**
37+
38+
You can run the script with `python` or `torchrun`:
39+
40+
```bash
41+
# When running with `python`, only one VLM instance is instantiated, and it might use multiple GPUs (depending on its default behavior).
42+
# That is recommended for evaluating very large VLMs (like IDEFICS-80B-Instruct).
43+
44+
# IDEFICS-80B-Instruct on MMBench_DEV_EN, MME, and SEEDBench_IMG, Inference and Evalution
45+
python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct --verbose
46+
# IDEFICS-80B-Instruct on MMBench_DEV_EN, MME, and SEEDBench_IMG, Inference only
47+
python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct --verbose --mode infer
48+
49+
# When running with `torchrun`, one VLM instance is instantiated on each GPU. It can speed up the inference.
50+
# However, that is only suitable for VLMs that consume small amounts of GPU memory.
51+
52+
# IDEFICS-9B-Instruct, Qwen-VL-Chat, mPLUG-Owl2 on MMBench_DEV_EN, MME, and SEEDBench_IMG. On a node with 8 GPU. Inference and Evaluation.
53+
torchrun --nproc-per-node=8 run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct qwen_chat mPLUG-Owl2 --verbose
54+
# Qwen-VL-Chat on MME. On a node with 2 GPU. Inference and Evaluation.
55+
torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
56+
```
57+
58+
The evaluation results will be printed as logs, besides. **Result Files** will also be generated in the directory `$YOUR_WORKING_DIRECTORY/{model_name}`. Files ending with `.csv` contain the evaluated metrics.

README.md

+4-62
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
<a href="https://rank.opencompass.org.cn/leaderboard-multimodal">🏆 Learderboard </a> •
66
<a href="#-datasets-models-and-evaluation-results">📊Datasets & Models </a> •
77
<a href="#%EF%B8%8F-quickstart">🏗️Quickstart </a> •
8-
<a href="#%EF%B8%8F-custom-benchmark-or-vlm">🛠️Support New </a> •
8+
<a href="#%EF%B8%8F-development-guide">🛠️Development </a> •
99
<a href="#-the-goal-of-vlmevalkit">🎯Goal </a> •
1010
<a href="#%EF%B8%8F-citation">🖊️Citation </a>
1111
</div>
@@ -100,69 +100,11 @@ print(ret) # There are two apples in the provided images.
100100

101101
## 🏗️ QuickStart
102102

103-
Before running the evaluation script, you need to **configure** the VLMs and set the model_paths properly.
103+
See [QuickStart](/QuickStart.md) for a quick start guide.
104104

105-
After that, you can use a single script `run.py` to inference and evaluate multiple VLMs and benchmarks at a same time.
105+
## 🛠️ Development Guide
106106

107-
### Step0. Installation
108-
109-
```bash
110-
git clone https://github.com/open-compass/VLMEvalKit.git
111-
cd VLMEvalKit
112-
pip install -e .
113-
```
114-
115-
### Step1. Configuration
116-
117-
**VLM Configuration**: All VLMs are configured in `vlmeval/config.py`, for some VLMs, you need to configure the code root (MiniGPT-4, PandaGPT, etc.) or the model_weight root (LLaVA-v1-7B, etc.) before conducting the evaluation. During evaluation, you should use the model name specified in `supported_VLM` in `vlmeval/config.py` to select the VLM. For MiniGPT-4 and InstructBLIP, you also need to modify the config files in `vlmeval/vlm/misc` to configure LLM path and ckpt path.
118-
119-
Following VLMs require the configuration step:
120-
121-
**Code Preparation & Installation**: InstructBLIP ([LAVIS](https://github.com/salesforce/LAVIS)), LLaVA ([LLaVA](https://github.com/haotian-liu/LLaVA)), MiniGPT-4 ([MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4)), mPLUG-Owl2 ([mPLUG-Owl2](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2)), OpenFlamingo-v2 ([OpenFlamingo](https://github.com/mlfoundations/open_flamingo)), PandaGPT-13B ([PandaGPT](https://github.com/yxuansu/PandaGPT)), TransCore-M ([TransCore-M](https://github.com/PCIResearch/TransCore-M)).
122-
123-
**Manual Weight Preparation & Configuration**: InstructBLIP, LLaVA-v1-7B, MiniGPT-4, PandaGPT-13B
124-
125-
### Step2. Evaluation
126-
127-
We use `run.py` for evaluation. To use the script, you can use `$VLMEvalKit/run.py` or create a soft-link of the script (to use the script anywhere):
128-
129-
**Arguments**
130-
131-
- `--data (list[str])`: Set the dataset names that are supported in VLMEvalKit (defined in `vlmeval/utils/dataset_config.py`).
132-
- `--model (list[str])`: Set the VLM names that are supported in VLMEvalKit (defined in `supported_VLM` in `vlmeval/config.py`).
133-
- `--mode (str, default to 'all', choices are ['all', 'infer'])`: When `mode` set to "all", will perform both inference and evaluation; when set to "infer", will only perform the inference.
134-
- `--nproc (int, default to 4)`: The number of threads for OpenAI API calling.
135-
136-
**Command**
137-
138-
You can run the script with `python` or `torchrun`:
139-
140-
```bash
141-
# When running with `python`, only one VLM instance is instantiated, and it might use multiple GPUs (depending on its default behavior).
142-
# That is recommended for evaluating very large VLMs (like IDEFICS-80B-Instruct).
143-
144-
# IDEFICS-80B-Instruct on MMBench_DEV_EN, MME, and SEEDBench_IMG, Inference and Evalution
145-
python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct --verbose
146-
# IDEFICS-80B-Instruct on MMBench_DEV_EN, MME, and SEEDBench_IMG, Inference only
147-
python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct --verbose --mode infer
148-
149-
# When running with `torchrun`, one VLM instance is instantiated on each GPU. It can speed up the inference.
150-
# However, that is only suitable for VLMs that consume small amounts of GPU memory.
151-
152-
# IDEFICS-9B-Instruct, Qwen-VL-Chat, mPLUG-Owl2 on MMBench_DEV_EN, MME, and SEEDBench_IMG. On a node with 8 GPU. Inference and Evaluation.
153-
torchrun --nproc-per-node=8 run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct qwen_chat mPLUG-Owl2 --verbose
154-
# Qwen-VL-Chat on MME. On a node with 2 GPU. Inference and Evaluation.
155-
torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
156-
```
157-
158-
The evaluation results will be printed as logs, besides. **Result Files** will also be generated in the directory `$YOUR_WORKING_DIRECTORY/{model_name}`. Files ending with `.csv` contain the evaluated metrics.
159-
160-
## 🛠️ Custom Benchmark or VLM
161-
162-
To implement a custom benchmark or VLM in **VLMEvalKit**, please refer to [Custom_Benchmark_and_Model](/Custom_Benchmark_and_Model.md). Example PRs to follow:
163-
164-
- [**New Model**] Support Monkey ([#45](https://github.com/open-compass/VLMEvalKit/pull/45/files))
165-
- [**New Benchmark**] Support AI2D ([#51](https://github.com/open-compass/VLMEvalKit/pull/51/files))
107+
To develop custom benchmarks, VLMs, or simply contribute other codes to **VLMEvalKit**, please refer to [Development_Guide](/Development.md).
166108

167109
## 🎯 The Goal of VLMEvalKit
168110

0 commit comments

Comments
 (0)