vlm-recipes

User-friendly tool for seamless pre-training and visual instruction tuning of Vision-Language Models

vlm-recipes is a tool designed to make the training of Vision-Language Models (VLMs) easy and efficient. With an intuitive interface and flexible configuration options, researchers and developers can effortlessly manage training on any VLM architecture or dataset. The tool supports distributed training on large GPU clusters using PyTorch FullyShardedDataParallel (FSDP) as its backend and offers extensive customization, enabling users to leverage cutting-edge techniques with ease.

What sets vlm-recipes apart is its seamless integration with Hugging Face Transformers, allowing you to continue training or perform fine-tuning on VLMs with minimal changes. This means there’s no need to convert Hugging Face Transformers checkpoints or deal with complex workflows—just focus on refining your model.

Feature	vlm-recipes	llm-recipes
VLM Support	✅	❌
LLM Support	❌	✅

The currently supported VLMs are as follows:

This library is experimental and under active development. We plan to add some breaking changes in the future to improve the usability and performance of the library.

To check out the companion project llm-recipes, click here!
https://github.com/okoge-kaz/llm-recipes

ITERATION=1000
FORMATTED_ITERATION=$(printf "iter_%07d" $ITERATION)

BASE_MODEL_CHECKPOINT=/path/to/huggingface-checkpoint/idefics2-8b
CHECK_POINT_PATH=/path/to/train/checkpoint/${FORMATTED_ITERATION}/model.pt
HF_OUTPUT_PATH=/path/to/converted/checkpoint/${FORMATTED_ITERATION}

mkdir -p $HF_OUTPUT_PATH

python tools/checkpoint-convert/convert_ckpt.py \
    --model $BASE_MODEL_CHECKPOINT \
    --ckpt $CHECK_POINT_PATH \
    --out $HF_OUTPUT_PATH

(The complete conversion script is located at tools/checkpoint-convert/scripts/convert.sh)

Inference

After checkpoint conversion, you can use the Hugging Face Transformers library to load the converted checkpoint and perform inference.

The following is an example of how to do inference using the converted checkpoint (huggingface format):

python tools/inference/inference.py \
  --model-path /path/to/huggingface-checkpoint/idefics2 \
  --processor-path /path/to/huggingface-processor/idefics2 \
  --image-path images/drive_situation_image.jpg \
  --prompt "In the situation in the image, is it permissible to start the car when the light turns green?"

(The complete inference script is located at tools/inference/inference.sh)

Projects Using vlm-recipes

Below are some of the projects where we have directly used vlm-recipes:

Turing(company)'s GENIAC project (VLM training)

Citation

@software{
author = {Kazuki Fujii and Daiki Shiono and Yu Yamaguchi and Taishi Nakamura and Rio Yokota},
month = {Aug},
title = {{vlm-recipes}},
url = {https://github.com/turingmotors/vlm-recipes},
version = {0.1.0},
year = {2024}
}

Acknowledgement

This repository is based on results obtained from a project, JPNP20017, subsidized by the New Energy and Industrial Technology Development Organization (NEDO).

Name		Name	Last commit message	Last commit date
Latest commit History 253 Commits
.vscode		.vscode
examples		examples
images		images
megatron_lm		megatron_lm
scripts		scripts
src/llama_recipes		src/llama_recipes
tools		tools
.gitignore		.gitignore
README.md		README.md
install.sh		install.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vlm-recipes

User-friendly tool for seamless pre-training and visual instruction tuning of Vision-Language Models

Table of Contents

Installation

Multi-node Support

FlashAttention

Usage

Visual Instruction Tuning

1. Data Preparation

2. Change Dataset Class

3. Training

VLM Pre-Training (🚧 Under Development 🚧)

Checkpoint formats

vlm-recipes format

PyTorch format to Hugging Face format

Inference

Projects Using vlm-recipes

Citation

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

turingmotors/vlm-recipes

Folders and files

Latest commit

History

Repository files navigation

vlm-recipes

User-friendly tool for seamless pre-training and visual instruction tuning of Vision-Language Models

Table of Contents

Installation

Multi-node Support

FlashAttention

Usage

Visual Instruction Tuning

1. Data Preparation

2. Change Dataset Class

3. Training

VLM Pre-Training (🚧 Under Development 🚧)

Checkpoint formats

vlm-recipes format

PyTorch format to Hugging Face format

Inference

Projects Using vlm-recipes

Citation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages