Merge branch 'EvolvingLMMs-Lab:main' into main

EvolvingLMMs-Lab · Mar 3, 2025 · 5cb2d6c · 5cb2d6c
2 parents a628d51 + 9310d89
commit 5cb2d6c
Show file tree

Hide file tree

Showing 77 changed files with 2,962 additions and 151 deletions.
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -7,6 +7,9 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v3
+        with:
+          submodules: true
+          fetch-depth: 0
       - name: Set up Python
         uses: actions/setup-python@v4
         with:

diff --git a/.gitignore b/.gitignore
@@ -42,4 +42,5 @@ VATEX/
 lmms_eval/tasks/vatex/__pycache__/utils.cpython-310.pyc
 lmms_eval/tasks/mlvu/__pycache__/utils.cpython-310.pyc
 
-scripts/
+scripts/
+.env
diff --git a/README.md b/README.md
@@ -20,18 +20,14 @@
 
 ## Annoucement
 
+- [2025-02] 🚀🚀 We have integrated [`vllm`](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/544) into our models, enabling accelerated evaluation for both multimodal and language models. Additionally, we have incorporated [`openai_compatible`](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/546) to support the evaluation of any API-based model that follows the OpenAI API format. Check the usages [here](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/miscs/model_dryruns).
+
 - [2025-01] 🎓🎓 We have released our new benchmark: [Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos](https://arxiv.org/abs/2501.13826). Please refer to the [project page](https://videommmu.github.io/) for more details.
 
 - [2024-12] 🎉🎉 We have presented [MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs](https://arxiv.org/pdf/2411.15296), jointly with [MME Team](https://github.com/BradyFU/Video-MME) and [OpenCompass Team](https://github.com/open-compass).
 
 - [2024-11] 🔈🔊 The `lmms-eval/v0.3.0` has been upgraded to support audio evaluations for audio models like Qwen2-Audio and Gemini-Audio across tasks such as AIR-Bench, Clotho-AQA, LibriSpeech, and more. Please refer to the [blog](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/lmms-eval-0.3.md) for more details!
 
-- [2024-07] 🎉🎉 We have released the [technical report](https://arxiv.org/abs/2407.12772) and [LiveBench](https://huggingface.co/spaces/lmms-lab/LiveBench)! 
-
-- [2024-06] 🎬🎬 The `lmms-eval/v0.2.0` has been upgraded to support video evaluations for video models like LLaVA-NeXT Video and Gemini 1.5 Pro across tasks such as EgoSchema, PerceptionTest, VideoMME, and more. Please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.2/) for more details!
-
-- [2024-03] 📝📝 We have released the first version of `lmms-eval`, please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.1/) for more details!
-
 <details>
 <summary>We warmly welcome contributions from the open-source community! Below is a chronological list of recent tasks, models, and features added by our amazing contributors. </summary>
 
@@ -42,6 +38,9 @@
 - [2024-09] ⚙️️⚙️️️️ We upgrade `lmms-eval` to `0.2.3` with more tasks and features. We support a compact set of language tasks evaluations (code credit to [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)), and we remove the registration logic at start (for all models and tasks) to reduce the overhead. Now `lmms-eval` only launches necessary tasks/models. Please check the [release notes](https://github.com/EvolvingLMMs-Lab/lmms-eval/releases/tag/v0.2.3) for more details.
 - [2024-08] 🎉🎉 We welcome the new model [LLaVA-OneVision](https://huggingface.co/papers/2408.03326), [Mantis](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/162), new tasks [MVBench](https://huggingface.co/datasets/OpenGVLab/MVBench), [LongVideoBench](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/117), [MMStar](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/158). We provide new feature of SGlang Runtime API for llava-onevision model, please refer the [doc](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/commands.md) for inference acceleration
 - [2024-07] 👨‍💻👨‍💻 The `lmms-eval/v0.2.1` has been upgraded to support more models, including [LongVA](https://github.com/EvolvingLMMs-Lab/LongVA), [InternVL-2](https://github.com/OpenGVLab/InternVL), [VILA](https://github.com/NVlabs/VILA), and many more evaluation tasks, e.g. [Details Captions](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/136), [MLVU](https://arxiv.org/abs/2406.04264), [WildVision-Bench](https://huggingface.co/datasets/WildVision/wildvision-arena-data), [VITATECS](https://github.com/lscpku/VITATECS) and [LLaVA-Interleave-Bench](https://llava-vl.github.io/blog/2024-06-16-llava-next-interleave/).
+- [2024-07] 🎉🎉 We have released the [technical report](https://arxiv.org/abs/2407.12772) and [LiveBench](https://huggingface.co/spaces/lmms-lab/LiveBench)! 
+- [2024-06] 🎬🎬 The `lmms-eval/v0.2.0` has been upgraded to support video evaluations for video models like LLaVA-NeXT Video and Gemini 1.5 Pro across tasks such as EgoSchema, PerceptionTest, VideoMME, and more. Please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.2/) for more details!
+- [2024-03] 📝📝 We have released the first version of `lmms-eval`, please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.1/) for more details!
 
 </details>
 
@@ -194,8 +193,8 @@ python3 -m accelerate.commands.launch --num_processes=8 -m lmms_eval --config ./
 **Evaluation of video model (llava-next-video-32B)**
 ```bash
 accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
-    --model llavavid \
-    --model_args pretrained=lmms-lab/LLaVA-NeXT-Video-32B-Qwen,conv_template=qwen_1_5,video_decode_backend=decord,max_frames_num=32，mm_spatial_pool_mode=average,mm_newline_position=grid,mm_resampler_location=after \
+    --model llava_vid \
+    --model_args pretrained=lmms-lab/LLaVA-NeXT-Video-32B-Qwen,conv_template=qwen_1_5,video_decode_backend=decord,max_frames_num=32,mm_spatial_pool_mode=average,mm_newline_position=grid,mm_resampler_location=after \
     --tasks videomme \
     --batch_size 1 \
     --log_samples \

diff --git a/lmms_eval/models/__init__.py b/lmms_eval/models/__init__.py
@@ -11,6 +11,7 @@
 logger.add(sys.stdout, level="WARNING")
 
 AVAILABLE_MODELS = {
+    "aria": "Aria",
     "auroracap": "AuroraCap",
     "batch_gpt4": "BatchGPT4",
     "claude": "Claude",
@@ -21,44 +22,47 @@
     "gpt4v": "GPT4V",
     "idefics2": "Idefics2",
     "instructblip": "InstructBLIP",
+    "internvideo2": "InternVideo2",
     "internvl": "InternVLChat",
     "internvl2": "InternVL2",
     "llama_vid": "LLaMAVid",
+    "llama_vision": "LlamaVision",
     "llava": "Llava",
     "llava_hf": "LlavaHf",
     "llava_onevision": "Llava_OneVision",
     "llava_onevision_moviechat": "Llava_OneVision_MovieChat",
     "llava_sglang": "LlavaSglang",
     "llava_vid": "LlavaVid",
-    "slime": "Slime",
     "longva": "LongVA",
     "mantis": "Mantis",
     "minicpm_v": "MiniCPM_V",
     "minimonkey": "MiniMonkey",
     "moviechat": "MovieChat",
     "mplug_owl_video": "mplug_Owl",
+    "ola": "Ola",
+    "openai_compatible": "OpenAICompatible",
+    "oryx": "Oryx",
     "phi3v": "Phi3v",
-    "qwen_vl": "Qwen_VL",
-    "qwen2_vl": "Qwen2_VL",
     "qwen2_5_vl": "Qwen2_5_VL",
     "qwen2_5_vl_interleave": "Qwen2_5_VL_Interleave",
     "qwen2_audio": "Qwen2_Audio",
+    "qwen2_vl": "Qwen2_VL",
+    "qwen_vl": "Qwen_VL",
     "qwen_vl_api": "Qwen_VL_API",
     "reka": "Reka",
+    "ross": "Ross",
+    "slime": "Slime",
     "srt_api": "SRT_API",
     "tinyllava": "TinyLlava",
     "videoChatGPT": "VideoChatGPT",
+    "videochat2": "VideoChat2",
     "video_llava": "VideoLLaVA",
     "vila": "VILA",
+    "vita": "VITA",
+    "vllm": "VLLM",
     "xcomposer2_4KHD": "XComposer2_4KHD",
-    "internvideo2": "InternVideo2",
     "xcomposer2d5": "XComposer2D5",
-    "oryx": "Oryx",
-    "videochat2": "VideoChat2",
-    "llama_vision": "LlamaVision",
-    "aria": "Aria",
-    "ross": "Ross",
-    "vita": "VITA",
+    "egogpt": "EgoGPT",
 }
 
 

diff --git a/lmms_eval/models/aria.py b/lmms_eval/models/aria.py
@@ -106,12 +106,12 @@ def __init__(
         elif accelerator.num_processes == 1 and device_map == "auto":
             eval_logger.info(f"Using {accelerator.num_processes} devices with pipeline parallelism")
             self._rank = 0
-            self._word_size = 1
+            self._world_size = 1
         else:
             eval_logger.info(f"Using single device: {self._device}")
             self.model.to(self._device)
             self._rank = 0
-            self._word_size = 1
+            self._world_size = 1
         self.accelerator = accelerator
 
     @property
@@ -303,7 +303,7 @@ def _collate(x):
                     """
                     keywords = [
                         "Answer:",
-                        "answer is:", "choice is:", "option is:", 
+                        "answer is:", "choice is:", "option is:",
                         "Answer is:", "Choice is:", "Option is:",
                         "answer is", "choice is", "option is",
                         "Answer is", "Choice is", "Option is"

diff --git a/lmms_eval/models/auroracap.py b/lmms_eval/models/auroracap.py
@@ -159,7 +159,7 @@ def __init__(
         else:
             self.model.to(self._device)
             self._rank = 0
-            self._word_size = 1
+            self._world_size = 1
 
         # For Video Caption
         self.video_decode_backend = video_decode_backend

diff --git a/lmms_eval/models/cogvlm2.py b/lmms_eval/models/cogvlm2.py
@@ -71,7 +71,7 @@ def __init__(
             self._world_size = self.accelerator.num_processes
         else:
             self._rank = 0
-            self._word_size = 1
+            self._world_size = 1
 
     @property
     def config(self):