Chinese-English bilingual large multi-modal model series
Multimodal Conversation Model VisCPM-Chat • Text-to-image Model VisCPM-Paint • Inference • Paper
VisCPM-Chat Demo • VisCPM-Paint Demo • VisCPM-Chat🤗 • VisCPM-Paint🤗
简体中文 | English
VisCPM
is a family of open-source large multimodal models, which support multimodal conversational capabilities (VisCPM-Chat
model) and text-to-image generation capabilities (VisCPM-Paint
model) in both Chinese and English, achieving state-of-the-art performance among Chinese open-source multimodal models. VisCPM is trained based on the large language model CPM-Bee with 10B parameters, fusing visual encoder Muffin and visual decoder Diffusion-UNet to support visual inputs and outputs. Thanks to the good bilingual capability of CPM-Bee, VisCPM
can be pre-trained with English multimodal data only and well generalize to achieve promising Chinese multimodal capabilities.
- 👐 Open-source Usage: VisCPM is free to be used for personal and research purposes. By open-sourcing the VisCPM model family, we hope to promote the development of the open-source community of large multimodal models and related research.
- 🌟 Image and text generation coverage: VisCPM models provide relatively comprehensive support for image and text multimodal capabilities, covering both multimodal conversation (image-to-text generation) capabilities and text-to-image generation capabilities.
- 💫 Excellent bilingual performance: Thanks to the excellent bilingual capability of the base language model CPM-Bee, VisCPM achieves outstanding results in both bilingual multimodal conversation and text-to-image generation.
VisCPM
is continuously updating. We have provided functions such as low-resource reasoning, easy-to-use web deployment. We have provide new versions with upgraded capabilities, OmniLMM. Please continue to pay attention!
- [2024/04/22] 🚀 Welcome to follow our latest release of the MiniCPM-V 2.0 edge-side large multimodal model, which has leading Optical Character Recognition (OCR) and multimodal understanding capabilities. It has achieved the best level among open-source models in the comprehensive OCR capability benchmark OCRBench, and even approaches the performance of Gemini Pro in scene text understanding.
- [2024/02/02] 🚀 Welcome to follow our latest release of the OmniLMM large multimodal model! Among them, OmniLMM-3B is a bilingual multimodal dialogue model in Chinese and English, trained based on the bilingual large model MiniCPM-2.4B and the SigLip-400M visual encoder, using the same training process as VisCPM-Chat. It can be deployed on terminal devices and possesses advanced multimodal dialogue capabilities; OmniLMM-13B is an English multimodal model, initially trained based on EVA02-5B and Zephyr-7B-β, and compared to other models of the same scale, it demonstrates superior performance in multiple benchmark tests.
- [2024/01/16] 🎉 The paper of VisCPM is accepted by ICLR 2024 as spotlight (top 5%)!
- [2023/09/06] 🔌 VisCPM-Chat API Released! Now you can easily use the VisCPM-Chat model directly through the API. Check out the API Usage Guide for more details
- [2023/08/23] 📑 We release the paper of VisCPM: Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages. More impletation details and experimental results are presented in the paper.
- [2023/08/18]
⤴️ We upgrade to VisCPM-Chat-v1.1, with stronger detail understanding and complex reasoning ability! - [2023/08/18] 🛠️ We support fine-tuning to make VisCPM more suitable for your application scenarios!
- [2023/07/20] 🌐 We release VisCPM-Chat and VisCPM-Paint online demo!
- [2023/07/20] 🎢 We provide one-click deployment of local web version demo!
- [2023/07/20] ⚡️ We support low-resource inference, with minimum 5G GPU memory cost to run VisCPM-Chat!
- [2023/07/18] 🤗 VisCPM-Chat and VisCPM-Paint have been integrated into the huggingface framework!
VisCPM-Chat
supports bilingual multimodal conversations involving images in both Chinese and English. The model utilizes Muffin
visual encoding architecture and CPM-Bee
(10B) as the base LLM. It combines visual and language models and is optimized with the language modeling training objective. The model training consists of two stages: Multimodal Pretraining and Instruction Tuning.
-
Multimodal Pretraining:
VisCPM-Chat
is pretrained using approximately 150M high-quality English text-image pairs. The data sources include CC3M, CC12M, COCO, Visual Genome, Laion, etc. In this stage, the language model parameters remain fixed, and only the parameters of visual modules are updated to enable efficient alignment of vision and language representations. -
Instruction Tuning: We utilize the LLaVA-150K dataset that contains English multimodal instruction-following data. We mix this data with corresponding translated Chinese data to fine-tune the model and align its multimodal capabilities with user intents. In this stage, we update all model parameters to improve the data efficiency of instruction tuning. Interestingly, we observe that even when using only English instruction data for fine-tuning, the model can well comprehend Chinese questions but can only respond in English. This indicates that the model has achieved good generalization in terms of its multilingual and multimodal capabilities. By incorporating a small amount of translated Chinese data during the instruction tuning stage, we can align the model's response language with the user's question language.
We evaluate the model on the standard LLaVA English benchmark and the translated Chinese benchmark from the standard English benchmark. The evaluation benchmark examines the model's performance in conversation, detailed description, and complex reasoning, and uses GPT-4 for scoring. It can be observed that VisCPM-Chat
achieves the best average performance in Chinese multimodal capabilities, excelling in conversation and complex reasoning, while also demonstrating good English multimodal capabilities. We provide two versions of the model, namely VisCPM-Chat-balance
and VisCPM-Chat-zhplus
. The former has a balanced ability in both English and Chinese, while the latter has a stronger emphasis on Chinese proficiency. Both models use the same data during the instruction tuning stage. VisCPM-Chat-zhplus
additionally incorporates 20M cleaned native Chinese text-image pairs and 120M translated text-image pairs in Chinese during the pretraining stage. VisCPM-Chat-v1.1
additioanlly utilizes UniMM-Chat multimodal instruction tuning dataset.
Model | LLM Backbone | English | Chinese | |||||||
Conversation | Detailed Description | Complex Reasoning | Avg | Conversation | Detailed Description | Complex Reasoning | Avg | |||
English Model | MiniGPT4 | Vicuna-13B | 65.0 | 67.3 | 76.6 | 69.7 | - | - | - | - |
InstructBLIP | Vicuna-13B | 81.9 | 68.0 | 91.2 | 80.5 | - | - | - | - | |
LLaVA | Vicuna-13B | 89.5 | 70.4 | 96.2 | 85.6 | - | - | - | - | |
En-Zh Bilingual Model | mPLUG-Owl | LLaMA-7B | 64.6 | 47.7 | 80.1 | 64.2 | 76.3 | 61.2 | 77.8 | 72.0 |
VisualGLM | ChatGLM-6B | 62.4 | 63.0 | 80.6 | 68.7 | 76.6 | 87.8 | 83.6 | 82.7 | |
Ziya-Visual | Ziya-LLaMA-13B-v1 | 82.7 | 69.9 | 92.1 | 81.7 | 85.0 | 74.7 | 82.4 | 80.8 | |
Qwen-VL | Qwen-7B | 82.4 | 72.6 | 91.9 | 83.8 | 82.3 | 93.4 | 89.5 | 88.2 | |
VisCPM-Chat-balance | CPMBee-10B | 83.3 | 68.9 | 90.5 | 81.1 | 92.7 | 76.1 | 89.2 | 86.3 | |
VisCPM-Chat-zhplus | CPMBee-10B | 80.1 | 65.7 | 92.5 | 79.6 | 90.3 | 81.4 | 92.1 | 88.2 | |
VisCPM-Chat-v1.1 | CPMBee-10B | 80.1 | 67.1 | 97.1 | 81.5 | 91.3 | 90.7 | 95.4 | 92.5 |
VisCPM-Paint
supports bilingual text-to-image generation. The model uses CPM-Bee
as the text encoder, UNet
as the image decoder, and fuses vision and language models using the objective of diffusion model. During the training process, the parameters of the language model remain fixed. The visual decoder is initialized with the parameters of Stable Diffusion 2.1, and it is fused with the language model by gradually unfreezing key bridging parameters. The model is trained on the LAION 2B English text-image pair dataset.
Similar to VisCPM-Chat
, we found that due to the bilingual capability of CPM-Bee
, VisCPM-Paint
can achieve good Chinese text-to-image generation by training only on English text-image pairs, surpassing the performance of Chinese open-source models. By incorporating an additional 20M cleaned native Chinese text-image pairs and 120M translated text-image pairs in Chinese, the model's Chinese text-to-image generation ability can be further improved. We sample 30,000 images from the standard image generation test set MSCOCO and calculated commonly used evaluation metrics FID (Fréchet Inception Distance) to assess the quality of generated images. Similarly, we provide two versions of the model, namely VisCPM-Paint-balance
and VisCPM-Paint-zhplus
. The former has a balanced ability in both English and Chinese, while the latter emphasizes Chinese proficiency. VisCPM-Paint-balance
is trained only using English text-image pairs, while VisCPM-Paint-zhplus
incorporates an additional 20M native Chinese text-image pairs and 120M translated text-image pairs in Chinese based on VisCPM-Paint-balance
.
Model | Zero-shot FID↓ | |
English | Chinese | |
GLIDE | 12.2 | - |
Make-A-Scene | 11.8 | - |
DALL·E-2 | 10.4 | - |
Unidiffuser | 9.7 | - |
Cogview2 | - | 24.0 |
Stable Diffusion | 8.6 | - |
AltDiffusion | 17.2 | 16.1 |
TaiyiDiffusion | - | 15.6 |
VisCPM-Paint-balance | 9.5 | 10.9 |
VisCPM-Paint-zhplus | 9.9 | 9.6 |
- Clone this repository and navigate to source folder
git clone https://github.com/OpenBMB/VisCPM.git
cd VisCPM
- Create conda environment
conda create -n viscpm python=3.10 -y
conda activate viscpm
- Install dependencies
pip install torch>=1.10
pip install -r requirements.txt
Model | Description | Download Link |
---|---|---|
VisCPM-Chat-v1.1 | Latest version of multimodal conversation model with stronger detail understanding and complex reasoning ability! | download |
VisCPM-Chat-balance | Multimodal conversation model with balanced proficiency in both Chinese and English | download |
VisCPM-Chat-zhplus | Multimodal conversation model with a strong emphasis on Chinese proficiency | download |
VisCPM-Paint-balance | Text-to-image model with balanced proficiency in both Chinese and English | download |
VisCPM-Paint-zhplus | Text-to-image model with a strong emphasis on Chinese proficiency | download |
After downloading the checkpoints, please refer to the following codes to run VisCPM-Chat
(replace '/path/to/checkpoint'
with actually path of downloaded checkpoint).
We can have a multimodal conversation with VisCPM-Chat using a few lines of codes.
# If the memory of your GPU is less than 40G, you can introduce the following environment variables. After the introduction, the memory usage is about 17G, but the time required for inference will be longer. This feature relies on the BMInf package.
export CUDA_MEMORY_CPMBEE_MAX=1g
from VisCPM import VisCPMChat
from PIL import Image
model_path = '/path/to/checkpoint'
viscpm_chat = VisCPMChat(model_path, image_safety_checker=True)
# We perform security checks on the input images by default.
image_path = 'figures/vlu_case1.png'
image = Image.open(image_path).convert("RGB")
question = '如果用一句中国唐代的著名诗人"李白"的古诗来描述这幅图像,你能想到什么?' # If you use an ancient poem by the famous Tang Dynasty poet "Li Bai" to describe this image, what can you think of?
answer, _, _ = viscpm_chat.chat(image, question)
print(answer)
We can obtain the following results:
“黄河之水天上来,奔流到海不复回。” 李白的这句诗可以用来形容这幅图片中汹涌澎湃、波涛汹涌的景象:一条湍急的河流从山上奔腾而下,形成了一幅令人叹为观止的画面,展示出大自然的力量和雄伟壮丽。
"The water of the Yellow River comes up from the sky, rushing to the sea and never returning." This line of Li Bai's poem can be used to describe the surging and turbulent scene in this picture: a rushing river rushes down from the mountain, forming a breathtaking scene. Breathtaking images that showcase the power and majesty of nature.
The codes can be easily extended to support multi-turn multimodal conversations.
from VisCPM import VisCPMChat
from PIL import Image
model_path = '/path/to/checkpoint'
viscpm_chat = VisCPMChat(model_path, image_safety_checker=True)
# We perform security checks on the input images by default.
image_path = 'figures/vlu_case2.jpeg'
image = Image.open(image_path).convert("RGB")
question = '这幅图像是在哪个节日拍摄的?' # At which festival was this image taken?
answer, context, vision_hidden_states = viscpm_chat.chat(image, question)
# pass history context of multi-turn conversation
question = '你能用什么古诗描述这幅画?' # What ancient poem can you use to describe this painting?
answer, context, _ = viscpm_chat.chat(image, question, context, vision_hidden_states=vision_hidden_states)
print(context)
We can obtain the following results:
User: 这幅图像是在哪个节日拍摄的?
AI: 这幅图像是在中秋节拍摄的, 也就是中国传统节日中的月圆之夜。
User: 你能用什么古诗描述这幅画?
AI: “明月几时有,把酒问青天。” 这是苏轼的《水调歌头》中的一句诗,用来形容这幅图片再贴切不过了:在中秋之夜,月亮高高地挂在天空中,一座古老的建筑沐浴着月光,营造出一种宁静祥和的气氛。
User: On which festival was this image taken?
AI: This image was taken during the Mid-Autumn Festival, which is the night of the full moon in the traditional Chinese festival.
User: What ancient poem can you use to describe this painting?
AI: "When did the bright moon first exist? With a cup of wine in my hand, I ask the blue sky." This is a line from Su Shi's "Shui Diao Ge Tou", which can't be more appropriate to describe this picture: On the night of Mid-Autumn Festival, the moon hangs high in the sky among them, an ancient building is bathed in moonlight, creating a peaceful atmosphere.
We offer an API, and you can easily experience VisCPM-Chat with the following code. Supported input formats and usage of the API are as follows:
import requests
import base64
url = "http://34.143.180.202:3389/viscpm"
resp = requests.post(url,json={
# need to modify
"image": base64.b64encode(open("path/to/image", "rb").read()).decode(),
"question": "Describe this image",
})
resp = resp.json()
print(resp)
After downloading the checkpoints, please refer to the following codes to run VisCPM-Paint
(replace '/path/to/checkpoint'
with actually path of downloaded checkpoint).
The input prompts of the images above can be found at prompts.txt.
# If the memory of your GPU is less than 40G, you can introduce the following environment variables. After the introduction, the memory usage is about 17G, but the time required for inference will be longer. This feature relies on the BMInf package.
export CUDA_MEMORY_CPMBEE_MAX=1g
from VisCPM import VisCPMPaint
painter = VisCPMPaint('/path/to/checkpoint', image_safety_checker=True, prompt_safety_checker=True, add_ranker=True)
# We perform security checks on the input text and output images by default. Additionally, the default setting includes image reranking.
image = painter.generate('人闲桂花落,月静春山空')
# The sweet-scented osmanthus falls when people are idle, the moon is quiet and the mountains are empty in spring.
# Corresponding to the second picture in the first row of the above picture.
image.save('/data/test.png')
In our code, we have enabled the default security checks for both input text and output images.
Additionally, we have implemented a default setting of reranking for the generated images. This means that for a given input, we generate four images simultaneously and return the one with the highest relevance score to the input, which is evaluated using Chinese-Clip. Reranking enhances the stability of the generated image quality but may also slow the model's generation speed. If you prefer to obtain the generated results quickly, you can disable the reranking mechanism.
If you are providing English text as input for generating images, it is advisable to disable the reranking mechanism and input text checker, since the scoring model used for reranking and safety checker for the input prompt are specifically trained for Chinese text.
We use BMInf to reduce GPU memory costs. First you need to install BMInf by pip install bminf
, and then specify export CUDA_MEMORY_CPMBEE_MAX=1g
the environment variable in shell, and then follow the above steps to inference. The minimum GPU memory usage of VisCPM-Chat can be reduced to 5G, and the minimum GPU memory usage of VisCPM-Paint can be reduced to 17G.
We provide a simple web version demo based on gradio. First you need to install gradio: pip install gradio
, and then execute the following command:
git clone https://github.com/OpenBMB/VisCPM.git
cd VisCPM
python demo_chat.py # viscpm_chat demo, or
python demo_paint.py # viscpm_paint demo
We provide the fine-tuning code for VisCPM-Chat. Users can fine-tune it on their own private data. The fine-tuning code is located in the finetune/ft_viscpm_chat
directory, and the specific usage of the fine-tuning code is as follows:
# Get the dataset
bash ./finetune/ft_viscpm_chat/get_llava150k_zh.sh
# Model fine-tuning, note to modify the dataset and model checkpoint paths within
bash ./finetune/ft_viscpm_chat/run_viscpm_chat_ft.sh
# node: 8
# batch_size: 8 * 1
# More details can be found in './finetune/ft_viscpm_chat/config/viscpm_chat_ft.json' and './finetune/ft_viscpm_chat/run_viscpm_chat_ft.sh'
Note:
deepspeed-0.9.1
is used in the fine-tuning code, and the installation method can be found in here.- Currently, we have only tested the codes of fine-tuning on
Linux
. If you are fine-tuning under other system configurations, you may need to modify some of the code
As a multimodal model, VisCPM
generates content by learning from a vast amount of public image and text data. However, it does not possess the ability to comprehend or express personal opinions or value judgments. Any content generated by VisCPM does not represent the viewpoints or positions of the model developers.
Therefore, when using content generated by VisCPM
, users should take full responsibility for evaluating and verifying it on their own
To prevent the model from being misused to process or generate content that violates widely accepted societal values, we have incorporated a content safety module in VisCPM
. When the safety module detects image or text content that does not comply with safety regulations during model processing or generation, it intercepts the corresponding content. We performed security checks on the input images accepted by VisCPM-Chat
and the input text and output images of VisCPM-Paint
. While the safety module in VisCPM still has room for improvement, there may be instances of both false positives and false negatives. We will continue to enhance the performance of the safety module in future updates.
VisCPM is governed by the GML License, and permits individual and research usages. If you intend to utilize the model for commercial purposes, please reach out to [email protected] to negotiate commercial licensing.
The CPM-Bee base, governed by the General Model License (GML), permits commercial usage. If you intend to utilize the model for commercial purposes, please reach out to [email protected] to obtain the certificate of authorization.
VisCPM
is still undergoing continuous improvement, and we will further optimize it in the following aspects:
- Enabling model quantization
This project is developed by the following institutions:
Please consider citing the following papers if our work is helpful to you
@article{VisCPM,
title={Large multilingual models pivot zero-shot multimodal learning across languages},
author={Hu, Jinyi and Yao, Yuan and Wang, Chongyi and Wang, Shan and Pan, Yinxu and Chen, Qianyu and Yu, Tianyu and Wu, Hanghao and Zhao, Yue and Zhang, Haoye and others},
journal={arXiv preprint arXiv:2308.12038},
year={2023}
}
@article{muffin,
title={Reformulating vision-language foundation models and datasets towards universal multimodal assistants},
author={Yu, Tianyu and Hu, Jinyi and Yao, Yuan and Zhang, Haoye and Zhao, Yue and Wang, Chongyi and Wang, Shan and Pan, Yinxv and Xue, Jiao and Li, Dahai and others},
journal={arXiv preprint arXiv:2310.00653},
year={2023}
}