GitHub - amazon-science/VideoLISA

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

Zechen Bai ¹ Tong He ² Haiyang Mei ¹ Pichao Wang ² Ziteng Gao ¹ Joya Chen ¹ Lei Liu ² Zheng Zhang ² Mike Zheng Shou ¹

NeurIPS 2024

¹ Show Lab, National University of Singapore ² Amazon

News

[2024-11-20] We released the training and inference code.
[2024-09-29] We released our paper on arXiv.

TODO

Release the inference code.
Release the training code.
Instructions on supporting more datasets.

Setup Environment

conda create -n videolisa python=3.10 -y
conda activate videolisa
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install flash-attn --no-build-isolation

Prepare Data

First, please prepare the image data following this instruction in LISA.

We introduce the video datasets used in this project. Note that the data paths for video datasets are currently hard-coded in each dataset file in the utils folder. You may need to adjust it accordingly.

MeViS

Download the dataset from the official release. Then, extract and organize the file. We expect the directory structure to be the following:

mevis
├── train                       // Split Train
│   ├── JPEGImages
│   │   ├── <video #1  >
│   │   ├── <video #2  >
│   │   └── <video #...>
│   │
│   ├── mask_dict.json
│   └── meta_expressions.json
│
├── valid_u                     // Split Val^u
│   ├── JPEGImages
│   │   └── <video ...>
│   │
│   ├── mask_dict.json
│   └── meta_expressions.json
│
└── valid                       // Split Val
    ├── JPEGImages
    │   └── <video ...>
    │
    └── meta_expressions.json

Ref-YouTube-VOS and Ref-DAVIS-17

Prepare Ref-YouTube-VOS and Ref-DAVIS-17 datasets following the instructions of ReferFormer.

YouTube-VOS

Download teh dataset from the website and organize it as follows:

YTVOS
├── train
│   ├── JPEGImages
│   ├── Annotations
│   ├── meta.json

Training

We provide a sample training script in run_train.sh. In our own experiments, we use 8 node (64 A10 24G GPUs) in total to train the model. Under this setting, we set batch_size=2 and grad_accumulation_steps=1, so that the global effective batch size is batch_size*grad_accumulation_steps*num_gpus=128. You can modify these settings based on your hardwares. However, we did not explore other training hyper-parameters. If you don't have sufficient GPUs, don't give up, you may still try to train the model with small batch size. One tip: if you use small batch size, also reducing the learning rate might help.

After training finished, to get the full model weight:

cd ./runs/video-lisa-3.8b-3k-iter/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin

Weight merging

Since the script do LoRA training with the help of deepspeed by default, after training, you need to merge the lora weights back to the model.

CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
  --version="MBZUAI/LLaVA-Phi-3-mini-4k-instruct" \
  --weight="runs/video-lisa-3.8b-3k-iter/pytorch_model.bin" \
  --save_path="runs/video-lisa-3.8b-3k-iter/merged"

Evaluation

MeViS

Before jumping into the follow commands, you may look into the involved scripts and config the data paths.

# Step 1
bash evaluation/mevis_val_u/run_inference_mevis.sh

# Step 2
bash evaluation/mevis_val_u/run_eval_mevis.sh

Other Datasets

Ongoing.

Citation

To cite the paper and model, please use the below:

@article{bai2024videolisa,
  title={One token to seg them all: Language instructed reasoning segmentation in videos},
  author={Bai, Zechen and He, Tong and Mei, Haiyang and Wang, Pichao and Gao, Ziteng and Chen, Joya and Liu, Lei and Zhang, Zheng and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2409.19603},
  year={2024}
}

Acknowledgments

This work is heavily based on LISA, LLaVA, LLaVA-pp, Segment-Anything and Phi-3. Thanks to all the authors for their great works!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
evaluation/mevis_val_u		evaluation/mevis_val_u
model		model
utils		utils
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
merge_lora_weights_and_save_hf_model.py		merge_lora_weights_and_save_hf_model.py
pyproject.toml		pyproject.toml
run_train.sh		run_train.sh
train_joint.py		train_joint.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

TODO

Setup Environment

Prepare Data

MeViS

Ref-YouTube-VOS and Ref-DAVIS-17

YouTube-VOS

Training

Weight merging

Evaluation

MeViS

Other Datasets

Citation

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

License

amazon-science/VideoLISA

Folders and files

Latest commit

History

Repository files navigation

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

TODO

Setup Environment

Prepare Data

MeViS

Ref-YouTube-VOS and Ref-DAVIS-17

YouTube-VOS

Training

Weight merging

Evaluation

MeViS

Other Datasets

Citation

Acknowledgments

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages