User-friendly tool for seamless pre-training and visual instruction tuning of Vision-Language Models
vlm-recipes is a tool designed to make the training of Vision-Language Models (VLMs) easy and efficient. With an intuitive interface and flexible configuration options, researchers and developers can effortlessly manage training on any VLM architecture or dataset. The tool supports distributed training on large GPU clusters using PyTorch FullyShardedDataParallel (FSDP) as its backend and offers extensive customization, enabling users to leverage cutting-edge techniques with ease.
What sets vlm-recipes apart is its seamless integration with Hugging Face Transformers, allowing you to continue training or perform fine-tuning on VLMs with minimal changes. This means there’s no need to convert Hugging Face Transformers checkpoints or deal with complex workflows—just focus on refining your model.
Feature | vlm-recipes | llm-recipes |
---|---|---|
VLM Support | ✅ | ❌ |
LLM Support | ❌ | ✅ |
The currently supported VLMs are as follows:
This library is experimental and under active development. We plan to add some breaking changes in the future to improve the usability and performance of the library.
To check out the companion project llm-recipes, click here!
https://github.com/okoge-kaz/llm-recipes
- vlm-recipes
- Table of Contents
This package has been tested with Python 3.10 and 3.11. The recommended environment is with CUDA Toolkit 12.1.
To install the required packages, simply run:
pip install -r requirements.txt
Note: The requirements.txt assumes that CUDA Toolkit 12.1 is installed on your system.
For multi-node support, ensure you have the following dependencies installed:
module load openmpi/4.x.x
pip install mpi4py
For GPU-accelerated FlashAttention, follow these steps:
pip install ninja packaging wheel
pip install flash-attn --no-build-isolation
src/llama_recipes/utils/visual_instruct.py
: DataLoader for Visual Instruction Tuningsrc/llama_recipes/datasets/llava_pretrain.py
: LLaVA format dataset
If you use LLaVA formatted datasets (e.g., LLaVA-PreTrain, LLaVA-Instruct), please prepare the dataset in the following format:
{
"image": "/image/path/to/image_1.png",
"conversations": [
{
"from": "human",
"value": "<image>\nCould you explain what is happening in this image?"
},
{
"from": "gpt",
"value": "This is a picture of a cat sitting on a chair."
}
]
}
If you want to train with your own dataset, please change the dataset class in src/llama_recipes/datasets/llava_pretrain.py
or make your own dataset class.
We provide example scripts for visual instruction tuning for Idefics2 in scripts/tsubame/llava_pretrain/idefics2-8b.sh
and LLaVA-NeXT in scripts/tsubame/llava_pretrain/llava-next-7b.sh
. You can modify the script to suit your needs.
This section is currently under development.
We will release this section with more information soon.
vlm-recipes supports PyTorch checkpoint format: The PyTorch format is a simple checkpoint format. The example of the PyTorch format is as follows:
model.pt optimizer.pt rng.pt sampler.pt scheduler.pt
You can convert the PyTorch format to the Hugging Face format using the following command:
ITERATION=1000
FORMATTED_ITERATION=$(printf "iter_%07d" $ITERATION)
BASE_MODEL_CHECKPOINT=/path/to/huggingface-checkpoint/idefics2-8b
CHECK_POINT_PATH=/path/to/train/checkpoint/${FORMATTED_ITERATION}/model.pt
HF_OUTPUT_PATH=/path/to/converted/checkpoint/${FORMATTED_ITERATION}
mkdir -p $HF_OUTPUT_PATH
python tools/checkpoint-convert/convert_ckpt.py \
--model $BASE_MODEL_CHECKPOINT \
--ckpt $CHECK_POINT_PATH \
--out $HF_OUTPUT_PATH
(The complete conversion script is located at tools/checkpoint-convert/scripts/convert.sh
)
After checkpoint conversion, you can use the Hugging Face Transformers library to load the converted checkpoint and perform inference.
The following is an example of how to do inference using the converted checkpoint (huggingface format):
python tools/inference/inference.py \
--model-path /path/to/huggingface-checkpoint/idefics2 \
--processor-path /path/to/huggingface-processor/idefics2 \
--image-path images/drive_situation_image.jpg \
--prompt "In the situation in the image, is it permissible to start the car when the light turns green?"
(The complete inference script is located at tools/inference/inference.sh
)
Below are some of the projects where we have directly used vlm-recipes:
- Turing(company)'s GENIAC project (VLM training)
@software{
author = {Kazuki Fujii and Daiki Shiono and Yu Yamaguchi and Taishi Nakamura and Rio Yokota},
month = {Aug},
title = {{vlm-recipes}},
url = {https://github.com/turingmotors/vlm-recipes},
version = {0.1.0},
year = {2024}
}
This repository is based on results obtained from a project, JPNP20017, subsidized by the New Energy and Industrial Technology Development Organization (NEDO).