vision-language

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

representation-learning multimodal vision-and-language contrastive-loss vision-language vision-transformer foundation-models audio-language

Updated Jun 27, 2024
Python

KAIST-Edlab / Study_Of_VL

Star

KAIST medical VL research group

medical vision-language

Updated Jun 27, 2024

wjpoom / SPEC

Star

[CVPR 2024] The official implementation of paper "synthesize, diagnose, and optimize: towards fine-grained vision-language understanding"

language computer-vision vision clip image-retrieval fine-grained robustness text-retrieval multimodal compositionality vision-language vision-language-model cvpr2024

Updated Jun 27, 2024
Jupyter Notebook

OpenDriveLab / DriveLM

Sponsor

Star

DriveLM: Driving with Graph Visual Question Answering

autonomous-driving vision-language large-language-models llm prompt-engineering prompting chain-of-thought tree-of-thoughts graph-of-thoughts

Updated Jun 26, 2024
HTML

TinyLLaVA / TinyLLaVA_Factory

Star

A Framework of Small-scale Large Multimodal Models

nlp transformers llama vision-language llava large-multimodal-models tinyllama

Updated Jun 25, 2024
Python

longzw1997 / Open-GroundingDino

Star

This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.

open-world object-detection vision-language open-world-detection

Updated Jun 25, 2024
Python

mbzuai-oryx / VideoGPT-plus

Star

Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

chatbot clip image-encoder video-encoder multimodal dual-encoder vision-language vicuna gpt4 vision-language-pretraining llava video-conversation video-chatbot llama3 gpt4o phi-3-mini

Updated Jun 20, 2024
Python

bytedance / Shot2Story

Star

A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.

benchmark video-summarization dataset video-captioning video-story vision-language video-question-answering video-language large-language-models video-language-pretraining video-story-generation

Updated Jun 17, 2024
Python

AlibabaResearch / AdvancedLiterateMachinery

Star

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

ocr computer-vision artificial-intelligence text-recognition document text-detection document-analysis end-to-end-ocr multimodal scene-text-recognition multimodal-deep-learning scene-text-detection vision-language document-understanding scene-text-detection-recognition document-recognition document-intelligence documentai vision-language-transformer vision-language-model

Updated Jun 16, 2024
C++

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.