MiniCPM-V Best Practices

MiniCPM-V is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image, video and text as inputs and provide high-quality text output, aiming to achieve strong performance and efficient deployment. The most notable models in this series currently include MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.6. The following sections provide detailed tutorials and guidance for each version of the MiniCPM-V models.

MiniCPM-V 2.6

MiniCPM-V 2.6 is the latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, the model surpasses GPT-4V in single image, multi-image and video understanding. It outperforms GPT-4o mini, Gemini 1.5 Pro and Claude 3.5 Sonnet in single image understanding, and advances MiniCPM-Llama3-V 2.5's features such as strong OCR capability, trustworthy behavior, multilingual support, and end-side deployment. Due to its superior token density, MiniCPM-V 2.6 can for the first time support real-time video understanding on end-side devices such as iPad.

Deployment Tutorial
Training Tutorial
Quantization Tutorial

MiniCPM-Llama3-V 2.5

MiniCPM-Llama3-V 2.5 is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0.

Quantization Tutorial
Training Tutorial
End-side Deployment
Deployment Tutorial
HD Decoding Tutorial
Model Structure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

best_practice_summary.md

best_practice_summary.md

MiniCPM-V Best Practices

MiniCPM-V 2.6

MiniCPM-Llama3-V 2.5

Files

best_practice_summary.md

Latest commit

History

best_practice_summary.md

File metadata and controls

MiniCPM-V Best Practices

MiniCPM-V 2.6

MiniCPM-Llama3-V 2.5