diff --git a/README.md b/README.md index f13bd21d6..d92aa1742 100644 --- a/README.md +++ b/README.md @@ -118,25 +118,26 @@ VLMEvalKit will use a **judge LLM** to extract answer from the output if you set **Supported PyTorch / HF Models** -| [**IDEFICS-[9B/80B/v2-8B/v3-8B]-Instruct**](https://huggingface.co/HuggingFaceM4/idefics-9b-instruct)๐Ÿš…๐ŸŽž๏ธ | [**InstructBLIP-[7B/13B]**](https://github.com/salesforce/LAVIS/blob/main/projects/instructblip/README.md) | [**LLaVA-[v1-7B/v1.5-7B/v1.5-13B]**](https://github.com/haotian-liu/LLaVA) | [**MiniGPT-4-[v1-7B/v1-13B/v2-7B]**](https://github.com/Vision-CAIR/MiniGPT-4) | -| ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | -| [**mPLUG-Owl[2/3]**](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2)๐ŸŽž๏ธ | [**OpenFlamingo-v2**](https://github.com/mlfoundations/open_flamingo)๐ŸŽž๏ธ | [**PandaGPT-13B**](https://github.com/yxuansu/PandaGPT) | [**Qwen-VL**](https://huggingface.co/Qwen/Qwen-VL)๐Ÿš…๐ŸŽž๏ธ
[**Qwen-VL-Chat**](https://huggingface.co/Qwen/Qwen-VL-Chat)๐Ÿš…๐ŸŽž๏ธ | -| [**VisualGLM-6B**](https://huggingface.co/THUDM/visualglm-6b)๐Ÿš… | [**InternLM-XComposer-[1/2]**](https://huggingface.co/internlm/internlm-xcomposer-7b)๐Ÿš… | [**ShareGPT4V-[7B/13B]**](https://sharegpt4v.github.io)๐Ÿš… | [**TransCore-M**](https://github.com/PCIResearch/TransCore-M) | -| [**LLaVA (XTuner)**](https://huggingface.co/xtuner/llava-internlm-7b)๐Ÿš… | [**CogVLM-[Chat/Llama3]**](https://huggingface.co/THUDM/cogvlm-chat-hf)๐Ÿš… | [**ShareCaptioner**](https://huggingface.co/spaces/Lin-Chen/Share-Captioner)๐Ÿš… | [**CogVLM-Grounding-Generalist**](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf)๐Ÿš… | -| [**Monkey**](https://github.com/Yuliang-Liu/Monkey)๐Ÿš…
[**Monkey-Chat**](https://github.com/Yuliang-Liu/Monkey)๐Ÿš… | [**EMU2-Chat**](https://github.com/baaivision/Emu)๐Ÿš…๐ŸŽž๏ธ | [**Yi-VL-[6B/34B]**](https://huggingface.co/01-ai/Yi-VL-6B) | [**MMAlaya**](https://huggingface.co/DataCanvas/MMAlaya)๐Ÿš… | -| [**InternLM-XComposer-2.5**](https://github.com/InternLM/InternLM-XComposer)๐Ÿš…๐ŸŽž๏ธ | [**MiniCPM-[V1/V2/V2.5/V2.6]**](https://github.com/OpenBMB/MiniCPM-V)๐Ÿš…๐ŸŽž๏ธ | [**OmniLMM-12B**](https://huggingface.co/openbmb/OmniLMM-12B) | [**InternVL-Chat-[V1-1/V1-2/V1-5/V2]**](https://github.com/OpenGVLab/InternVL)๐Ÿš…๐ŸŽž๏ธ | -| [**DeepSeek-VL**](https://github.com/deepseek-ai/DeepSeek-VL/tree/main)๐ŸŽž๏ธ | [**LLaVA-NeXT**](https://llava-vl.github.io/blog/2024-01-30-llava-next/)๐Ÿš…๐ŸŽž๏ธ | [**Bunny-Llama3**](https://huggingface.co/BAAI/Bunny-v1_1-Llama-3-8B-V)๐Ÿš… | [**XVERSE-V-13B**](https://github.com/xverse-ai/XVERSE-V-13B/blob/main/vxverse/models/vxverse.py) | -| [**PaliGemma-3B**](https://huggingface.co/google/paligemma-3b-pt-448) ๐Ÿš… | [**360VL-70B**](https://huggingface.co/qihoo360/360VL-70B) ๐Ÿš… | [**Phi-3-Vision**](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct)๐Ÿš…๐ŸŽž๏ธ
[**Phi-3.5-Vision**](https://huggingface.co/microsoft/Phi-3.5-vision-instruct)๐Ÿš…๐ŸŽž๏ธ | [**WeMM**](https://github.com/scenarios/WeMM)๐Ÿš… | -| [**GLM-4v-9B**](https://huggingface.co/THUDM/glm-4v-9b) ๐Ÿš… | [**Cambrian-[8B/13B/34B]**](https://cambrian-mllm.github.io/) | [**LLaVA-Next-[Qwen-32B]**](https://huggingface.co/lmms-lab/llava-next-qwen-32b) ๐ŸŽž๏ธ | [**Chameleon-[7B/30B]**](https://huggingface.co/facebook/chameleon-7b)๐Ÿš…๐ŸŽž๏ธ | -| [**Video-LLaVA-7B-[HF]**](https://github.com/PKU-YuanGroup/Video-LLaVA) ๐ŸŽฌ | [**VILA1.5-[3B/8B/13B/40B]**](https://github.com/NVlabs/VILA/)๐ŸŽž๏ธ | [**Ovis[1.5-Llama3-8B/1.5-Gemma2-9B/1.6-Gemma2-9B/1.6-Llama3.2-3B/1.6-Gemma2-27B]**](https://github.com/AIDC-AI/Ovis) ๐Ÿš…๐ŸŽž๏ธ | [**Mantis-8B-[siglip-llama3/clip-llama3/Idefics2/Fuyu]**](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) ๐ŸŽž๏ธ | -| [**Llama-3-MixSenseV1_1**](https://huggingface.co/Zero-Vision/Llama-3-MixSenseV1_1)๐Ÿš… | [**Parrot-7B**](https://github.com/AIDC-AI/Parrot) ๐Ÿš… | [**OmChat-v2.0-13B-sinlge-beta**](https://huggingface.co/omlab/omchat-v2.0-13B-single-beta_hf) ๐Ÿš… | [**Video-ChatGPT**](https://github.com/mbzuai-oryx/Video-ChatGPT) ๐ŸŽฌ | -| [**Chat-UniVi-7B[-v1.5]**](https://github.com/PKU-YuanGroup/Chat-UniVi) ๐ŸŽฌ | [**LLaMA-VID-7B**](https://github.com/dvlab-research/LLaMA-VID) ๐ŸŽฌ | [**VideoChat2-HD**](https://huggingface.co/OpenGVLab/VideoChat2_HD_stage4_Mistral_7B) ๐ŸŽฌ | [**PLLaVA-[7B/13B/34B]**](https://huggingface.co/ermu2001/pllava-7b) ๐ŸŽฌ | -| [**RBDash_72b**](https://github.com/RBDash-Team/RBDash) ๐Ÿš…๐ŸŽž๏ธ | [**xgen-mm-phi3-[interleave/dpo]-r-v1.5**](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-interleave-r-v1.5) ๐Ÿš…๐ŸŽž๏ธ | [**Qwen2-VL-[2B/7B/72B]**](https://github.com/QwenLM/Qwen2-VL)๐Ÿš…๐ŸŽž๏ธ | [**slime_[7b/8b/13b]**](https://github.com/yfzhang114/SliME)๐ŸŽž๏ธ | +| [**IDEFICS-[9B/80B/v2-8B/v3-8B]-Instruct**](https://huggingface.co/HuggingFaceM4/idefics-9b-instruct)๐Ÿš…๐ŸŽž๏ธ | [**InstructBLIP-[7B/13B]**](https://github.com/salesforce/LAVIS/blob/main/projects/instructblip/README.md) | [**LLaVA-[v1-7B/v1.5-7B/v1.5-13B]**](https://github.com/haotian-liu/LLaVA) | [**MiniGPT-4-[v1-7B/v1-13B/v2-7B]**](https://github.com/Vision-CAIR/MiniGPT-4) | +|--------------------------------------------------------------------------------------------------------------------------------------| ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | +| [**mPLUG-Owl[2/3]**](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2)๐ŸŽž๏ธ | [**OpenFlamingo-v2**](https://github.com/mlfoundations/open_flamingo)๐ŸŽž๏ธ | [**PandaGPT-13B**](https://github.com/yxuansu/PandaGPT) | [**Qwen-VL**](https://huggingface.co/Qwen/Qwen-VL)๐Ÿš…๐ŸŽž๏ธ
[**Qwen-VL-Chat**](https://huggingface.co/Qwen/Qwen-VL-Chat)๐Ÿš…๐ŸŽž๏ธ | +| [**VisualGLM-6B**](https://huggingface.co/THUDM/visualglm-6b)๐Ÿš… | [**InternLM-XComposer-[1/2]**](https://huggingface.co/internlm/internlm-xcomposer-7b)๐Ÿš… | [**ShareGPT4V-[7B/13B]**](https://sharegpt4v.github.io)๐Ÿš… | [**TransCore-M**](https://github.com/PCIResearch/TransCore-M) | +| [**LLaVA (XTuner)**](https://huggingface.co/xtuner/llava-internlm-7b)๐Ÿš… | [**CogVLM-[Chat/Llama3]**](https://huggingface.co/THUDM/cogvlm-chat-hf)๐Ÿš… | [**ShareCaptioner**](https://huggingface.co/spaces/Lin-Chen/Share-Captioner)๐Ÿš… | [**CogVLM-Grounding-Generalist**](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf)๐Ÿš… | +| [**Monkey**](https://github.com/Yuliang-Liu/Monkey)๐Ÿš…
[**Monkey-Chat**](https://github.com/Yuliang-Liu/Monkey)๐Ÿš… | [**EMU2-Chat**](https://github.com/baaivision/Emu)๐Ÿš…๐ŸŽž๏ธ | [**Yi-VL-[6B/34B]**](https://huggingface.co/01-ai/Yi-VL-6B) | [**MMAlaya**](https://huggingface.co/DataCanvas/MMAlaya)๐Ÿš… | +| [**InternLM-XComposer-2.5**](https://github.com/InternLM/InternLM-XComposer)๐Ÿš…๐ŸŽž๏ธ | [**MiniCPM-[V1/V2/V2.5/V2.6]**](https://github.com/OpenBMB/MiniCPM-V)๐Ÿš…๐ŸŽž๏ธ | [**OmniLMM-12B**](https://huggingface.co/openbmb/OmniLMM-12B) | [**InternVL-Chat-[V1-1/V1-2/V1-5/V2]**](https://github.com/OpenGVLab/InternVL)๐Ÿš…๐ŸŽž๏ธ | +| [**DeepSeek-VL**](https://github.com/deepseek-ai/DeepSeek-VL/tree/main)๐ŸŽž๏ธ | [**LLaVA-NeXT**](https://llava-vl.github.io/blog/2024-01-30-llava-next/)๐Ÿš…๐ŸŽž๏ธ | [**Bunny-Llama3**](https://huggingface.co/BAAI/Bunny-v1_1-Llama-3-8B-V)๐Ÿš… | [**XVERSE-V-13B**](https://github.com/xverse-ai/XVERSE-V-13B/blob/main/vxverse/models/vxverse.py) | +| [**PaliGemma-3B**](https://huggingface.co/google/paligemma-3b-pt-448) ๐Ÿš… | [**360VL-70B**](https://huggingface.co/qihoo360/360VL-70B) ๐Ÿš… | [**Phi-3-Vision**](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct)๐Ÿš…๐ŸŽž๏ธ
[**Phi-3.5-Vision**](https://huggingface.co/microsoft/Phi-3.5-vision-instruct)๐Ÿš…๐ŸŽž๏ธ | [**WeMM**](https://github.com/scenarios/WeMM)๐Ÿš… | +| [**GLM-4v-9B**](https://huggingface.co/THUDM/glm-4v-9b) ๐Ÿš… | [**Cambrian-[8B/13B/34B]**](https://cambrian-mllm.github.io/) | [**LLaVA-Next-[Qwen-32B]**](https://huggingface.co/lmms-lab/llava-next-qwen-32b) ๐ŸŽž๏ธ | [**Chameleon-[7B/30B]**](https://huggingface.co/facebook/chameleon-7b)๐Ÿš…๐ŸŽž๏ธ | +| [**Video-LLaVA-7B-[HF]**](https://github.com/PKU-YuanGroup/Video-LLaVA) ๐ŸŽฌ | [**VILA1.5-[3B/8B/13B/40B]**](https://github.com/NVlabs/VILA/)๐ŸŽž๏ธ | [**Ovis[1.5-Llama3-8B/1.5-Gemma2-9B/1.6-Gemma2-9B/1.6-Llama3.2-3B/1.6-Gemma2-27B]**](https://github.com/AIDC-AI/Ovis) ๐Ÿš…๐ŸŽž๏ธ | [**Mantis-8B-[siglip-llama3/clip-llama3/Idefics2/Fuyu]**](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) ๐ŸŽž๏ธ | +| [**Llama-3-MixSenseV1_1**](https://huggingface.co/Zero-Vision/Llama-3-MixSenseV1_1)๐Ÿš… | [**Parrot-7B**](https://github.com/AIDC-AI/Parrot) ๐Ÿš… | [**OmChat-v2.0-13B-sinlge-beta**](https://huggingface.co/omlab/omchat-v2.0-13B-single-beta_hf) ๐Ÿš… | [**Video-ChatGPT**](https://github.com/mbzuai-oryx/Video-ChatGPT) ๐ŸŽฌ | +| [**Chat-UniVi-7B[-v1.5]**](https://github.com/PKU-YuanGroup/Chat-UniVi) ๐ŸŽฌ | [**LLaMA-VID-7B**](https://github.com/dvlab-research/LLaMA-VID) ๐ŸŽฌ | [**VideoChat2-HD**](https://huggingface.co/OpenGVLab/VideoChat2_HD_stage4_Mistral_7B) ๐ŸŽฌ | [**PLLaVA-[7B/13B/34B]**](https://huggingface.co/ermu2001/pllava-7b) ๐ŸŽฌ | +| [**RBDash_72b**](https://github.com/RBDash-Team/RBDash) ๐Ÿš…๐ŸŽž๏ธ | [**xgen-mm-phi3-[interleave/dpo]-r-v1.5**](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-interleave-r-v1.5) ๐Ÿš…๐ŸŽž๏ธ | [**Qwen2-VL-[2B/7B/72B]**](https://github.com/QwenLM/Qwen2-VL)๐Ÿš…๐ŸŽž๏ธ | [**slime_[7b/8b/13b]**](https://github.com/yfzhang114/SliME)๐ŸŽž๏ธ | | [**Eagle-X4-[8B/13B]**](https://github.com/NVlabs/EAGLE)๐Ÿš…๐ŸŽž๏ธ,
[**Eagle-X5-[7B/13B/34B]**](https://github.com/NVlabs/EAGLE)๐Ÿš…๐ŸŽž๏ธ | [**Moondream1**](https://github.com/vikhyat/moondream)๐Ÿš…,
[**Moondream2**](https://github.com/vikhyat/moondream)๐Ÿš… | [**XinYuan-VL-2B-Instruct**](https://huggingface.co/Cylingo/Xinyuan-VL-2B)๐Ÿš…๐ŸŽž๏ธ | [**Llama-3.2-[11B/90B]-Vision-Instruct**](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)๐Ÿš… | -| [**Kosmos2**](https://huggingface.co/microsoft/kosmos-2-patch14-224)๐Ÿš… | [**H2OVL-Mississippi-[0.8B/2B]**](https://huggingface.co/h2oai/h2ovl-mississippi-2b)๐Ÿš…๐ŸŽž๏ธ | [**Pixtral-12B**](https://huggingface.co/mistralai/Pixtral-12B-2409)๐ŸŽž๏ธ | [**Falcon2-VLM-11B**](https://huggingface.co/tiiuae/falcon-11B-vlm)๐Ÿš… | -| [**MiniMonkey**](https://huggingface.co/mx262/MiniMonkey)๐Ÿš…๐ŸŽž๏ธ | [**LLaVA-OneVision**](https://huggingface.co/lmms-lab/llava-onevision-qwen2-72b-ov-sft)๐Ÿš…๐ŸŽž๏ธ | [**LLaVA-Video**](https://huggingface.co/collections/lmms-lab/llava-video-661e86f5e8dabc3ff793c944)๐Ÿš…๐ŸŽž๏ธ | [**Aquila-VL-2B**](https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen)๐Ÿš…๐ŸŽž๏ธ | -| [**Mini-InternVL-Chat-[2B/4B]-V1-5**](https://github.com/OpenGVLab/InternVL)๐Ÿš…๐ŸŽž๏ธ | [**InternVL2 Series**](https://huggingface.co/OpenGVLab/InternVL2-8B) ๐Ÿš…๐ŸŽž๏ธ | [**Janus-1.3B**](https://huggingface.co/deepseek-ai/Janus-1.3B)๐Ÿš…๐ŸŽž๏ธ | [**molmoE-1B/molmo-7B/molmo-72B**](https://huggingface.co/allenai/Molmo-7B-D-0924)๐Ÿš… | -| [**Points-[Yi-1.5-9B/Qwen-2.5-7B]**](https://huggingface.co/WePOINTS/POINTS-Yi-1-5-9B-Chat)๐Ÿš… | [**NVLM**](https://huggingface.co/nvidia/NVLM-D-72B)๐Ÿš… | [**VIntern**](https://huggingface.co/5CD-AI/Vintern-3B-beta)๐Ÿš…๐ŸŽž๏ธ | [**Aria**](https://huggingface.co/rhymes-ai/Aria)๐Ÿš…๐ŸŽž๏ธ | +| [**Kosmos2**](https://huggingface.co/microsoft/kosmos-2-patch14-224)๐Ÿš… | [**H2OVL-Mississippi-[0.8B/2B]**](https://huggingface.co/h2oai/h2ovl-mississippi-2b)๐Ÿš…๐ŸŽž๏ธ | [**Pixtral-12B**](https://huggingface.co/mistralai/Pixtral-12B-2409)๐ŸŽž๏ธ | [**Falcon2-VLM-11B**](https://huggingface.co/tiiuae/falcon-11B-vlm)๐Ÿš… | +| [**MiniMonkey**](https://huggingface.co/mx262/MiniMonkey)๐Ÿš…๐ŸŽž๏ธ | [**LLaVA-OneVision**](https://huggingface.co/lmms-lab/llava-onevision-qwen2-72b-ov-sft)๐Ÿš…๐ŸŽž๏ธ | [**LLaVA-Video**](https://huggingface.co/collections/lmms-lab/llava-video-661e86f5e8dabc3ff793c944)๐Ÿš…๐ŸŽž๏ธ | [**Aquila-VL-2B**](https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen)๐Ÿš…๐ŸŽž๏ธ | +| [**Mini-InternVL-Chat-[2B/4B]-V1-5**](https://github.com/OpenGVLab/InternVL)๐Ÿš…๐ŸŽž๏ธ | [**InternVL2 Series**](https://huggingface.co/OpenGVLab/InternVL2-8B) ๐Ÿš…๐ŸŽž๏ธ | [**Janus-1.3B**](https://huggingface.co/deepseek-ai/Janus-1.3B)๐Ÿš…๐ŸŽž๏ธ | [**molmoE-1B/molmo-7B/molmo-72B**](https://huggingface.co/allenai/Molmo-7B-D-0924)๐Ÿš… | +| [**Points-[Yi-1.5-9B/Qwen-2.5-7B]**](https://huggingface.co/WePOINTS/POINTS-Yi-1-5-9B-Chat)๐Ÿš… | [**NVLM**](https://huggingface.co/nvidia/NVLM-D-72B)๐Ÿš… | [**VIntern**](https://huggingface.co/5CD-AI/Vintern-3B-beta)๐Ÿš…๐ŸŽž๏ธ | [**Aria**](https://huggingface.co/rhymes-ai/Aria)๐Ÿš…๐ŸŽž๏ธ | +| [**VARCO-VISION-14B**](https://huggingface.co/NCSOFT/VARCO-VISION-14B-HF)๐Ÿš… | | | | ๐ŸŽž๏ธ: Support multiple images as inputs. diff --git a/docs/zh-CN/README_zh-CN.md b/docs/zh-CN/README_zh-CN.md index 8e8ae1e95..71d693262 100644 --- a/docs/zh-CN/README_zh-CN.md +++ b/docs/zh-CN/README_zh-CN.md @@ -128,6 +128,8 @@ $$^1$$ VLMEvalKit ๅœจ่ฏ„ๆต‹้›†็š„ๅฎ˜ๆ–นไปฃ็ ๅบ“ไธญ่ขซไฝฟ็”จ | **[MiniMonkey](https://huggingface.co/mx262/MiniMonkey)**๐Ÿš…๐ŸŽž๏ธ | **[LLaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-72b-ov-sft)**๐Ÿš…๐ŸŽž๏ธ | **[LLaVA-Video](https://huggingface.co/collections/lmms-lab/llava-video-661e86f5e8dabc3ff793c944)**๐Ÿš…๐ŸŽž๏ธ | **[Aquila-VL-2B](https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen)**๐Ÿš…๐ŸŽž๏ธ | | [**Mini-InternVL-Chat-[2B/4B]-V1-5**](https://github.com/OpenGVLab/InternVL)๐Ÿš…๐ŸŽž๏ธ | **[InternVL2 Series](https://huggingface.co/OpenGVLab/InternVL2-8B)** ๐Ÿš…๐ŸŽž๏ธ | **[Janus-1.3B](https://huggingface.co/deepseek-ai/Janus-1.3B)**๐Ÿš…๐ŸŽž๏ธ | **[molmoE-1B/molmo-7B/molmo-72B](https://huggingface.co/allenai/Molmo-7B-D-0924)**๐Ÿš… | | **[Points-[Yi-1.5-9B/Qwen-2.5-7B]](https://huggingface.co/WePOINTS/POINTS-Yi-1-5-9B-Chat)**๐Ÿš… | **[NVLM](https://huggingface.co/nvidia/NVLM-D-72B)**๐Ÿš… | **[VIntern](https://huggingface.co/5CD-AI/Vintern-3B-beta)**๐Ÿš…๐ŸŽž๏ธ | **[Aria](https://huggingface.co/rhymes-ai/Aria)**๐Ÿš…๐ŸŽž๏ธ | +| [**VARCO-VISION-14B**](https://huggingface.co/NCSOFT/VARCO-VISION-14B-HF)๐Ÿš… | | | | + ๐ŸŽž๏ธ ่กจ็คบๆ”ฏๆŒๅคšๅ›พ็‰‡่พ“ๅ…ฅใ€‚ diff --git a/vlmeval/config.py b/vlmeval/config.py index 36eb53948..6c346b57b 100644 --- a/vlmeval/config.py +++ b/vlmeval/config.py @@ -165,6 +165,7 @@ 'Aquila-VL-2B': partial(LLaVA_OneVision, model_path='BAAI/Aquila-VL-2B-llava-qwen'), 'llava_video_qwen2_7b':partial(LLaVA_OneVision, model_path='lmms-lab/LLaVA-Video-7B-Qwen2'), 'llava_video_qwen2_72b':partial(LLaVA_OneVision, model_path='lmms-lab/LLaVA-Video-72B-Qwen2'), + 'varco-vision-hf':partial(LLaVA_OneVision_HF, model_path='NCSOFT/VARCO-VISION-14B-HF'), } internvl_series = {