Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
-
Updated
Jun 30, 2024 - Python
Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
日本語LLMまとめ - Overview of Japanese LLMs
Deep Learning for Computer Vision 深度學習於電腦視覺 by Frank Wang 王鈺強
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
[CVPR 2024] The official implementation of paper "synthesize, diagnose, and optimize: towards fine-grained vision-language understanding"
DriveLM: Driving with Graph Visual Question Answering
A Framework of Small-scale Large Multimodal Models
This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
Code release for Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning
Read and review various papers in the field of Vision and Vision-Language.
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
[ICLR'24] Official code for "C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion"
Add a description, image, and links to the vision-language topic page so that developers can more easily learn about it.
To associate your repository with the vision-language topic, visit your repo's landing page and select "manage topics."