ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

Ali Athar, Xueqing Deng, Liang-Chieh Chen

🔥 Highlights

We introduce ViCaS, a human-annotated video dataset containing thousands of videos with detailed captions, along with pixel-precise segmentation masks for salient objects with phrase-grounding to the caption.
Our benchmark contains two tasks: (a) Video Captioning, which evaluates high-level video understanding, and (b) Language-Guided Video Instance Segmentation (LGVIS), which evaluates finegrained, pixel-level localization based on text prompts.
We propose Video-LLaVA-Seg, an effective baseline architecture that can tackle both of our benchmark tasks with a single, end-to-end trained model.

📢 Updates

12 Dec 2024: Uploaded v0.1 of the dataset with annotations for 7,331 videos.

👨‍💻 Todo

Release Video-LLaVA-Seg code
Release Video-LLaVA-Seg model weights

🔨 Environment Setup

Install the required packages:

pip3 install -r requirements.txt

Install ffmpeg

sudo apt install ffmpeg

🎥 Demo

You can visualize a few samples without downloading the whole dataset. We provide a few example videos under demo_data/videos. First, decode these videos into image frames by running:

bash demo_data/video_to_frames.sh

The frames will be saved to demo_data/video_frames. Then you can either run the Jupyter notebook or the equivalent Python script dataset_demo.py

⏬ Dataset Download

The annotations are hosted on HuggingFace. Clone the HF repo to a directory which we will call $VICAS_DIR. Due to copyright reasons, we only provide the annotations (captions and segmentation masks). You have two options to obtain the videos:

Option 1: Download and preprocess the videos

Download the Oops dataset videos from here and put them under some directory $OOPS_VIDEOS_DIR with train and val subdirectories. Then, run the preprocessing script:

python3 vicas/preprocess/gather_videos.py --vicas_dir $VICAS_DIR --oops_dir $OOPS_VIDEOS_DIR

This will create a directory at $VICAS_DIR/videos and put the required videos there with the video IDs prepended to the filename.

Option 2: Download preprocessed videos

Alexey Nekrasov from the research community has been working with the dataset and was kind enough to upload his preprocessed data to HuggingFace. Clone his repo and simply put the videos directory under $VICAS_DIR

Decode Video Frames

To train and evaluate LG-VIS, you also need to decode the videos into image frames:

bash vicas/preprocess/videos_to_frames.sh $VICAS_DIR/videos $VICAS_DIR/video_frames

The image frames for each video will be saved to a directory at $VICAS_DIR/video_frames/<video_id>.

File Structure

Once the videos are downloaded and decoded into frames, the file structure for $VICAS_DIR should look like this:

$VICAS_DIR
├── videos                      
│   ├── <video #1.mp4>
│   ├── <video #2.mp4>
│   ├── ...
├── video_frames
│   ├── <video #1>
│   │   └── 00000.jpg
│   │   └── 00001.jpg
│   │   └── ...
│   ├── <video #2>
│   │   └── 00000.jpg
│   │   └── 00001.jpg
│   ├── ...
├── annotations               
│   ├── v0.1
│   │   └── <video #1.json>
│   │   └── <video #2.json>
│   │   └── ...
├── splits
│   ├── v0.1
│   │   └── train.json
│   │   └── val.json
│   │   └── test.json

Annotation Format

We provide an easy-to-use API under vicas/dataset.py to parse the dataset and its JSON annotations. Please look at the ViCaSVideo class definition to see the JSON fields should be parsed. Refer to the Jupyter notebook or Python demo to see various use-cases for the API.

TL;DR for Captions Only

If you're only interested in the captions, just use the caption_parsed_en_gpt value in the annotation file:

import json
with open("<VICAS_DIR>/annotations/v0.1/00000.json") as fh:
    content = json.load(fh)
caption = content["caption_parsed_en_gpt"]

Chinese Captions

We also provide captions in Chinese. These were obtained by machine-translating the English captions. Note that no quality checks were applied to them so they may contain errors.

from vicas.caption_parsing import parse_caption

caption_raw = content["caption_raw_cn"]             # with object-grounding syntax
caption_parsed = parse_caption(caption_raw).parsed  # regular, human-readable caption

Benchmark Evaluation

The predictions are in a per-video JSON format similar to the ground-truth. A set of ~1000 prediction files is provided in the HF repo for reference. In short, each JSON file needs to have the following fields video_id, pred_caption and pred_lgvis_masks. You can inspect the example predictions to see the exact format.

Evaluate captioning accuracy requires Llama3-70B. Refer to the offical website to download the Meta-Llama-3-70B-Instruct model checkpoint. We use the original (3.0) model version. You will need 8 GPUs to run this model. We will call the checkpoint directory $LLAMA3_MODEL_DIR and it should contain tokenizer.model and several .pth files. With this setup, you can the run the evaluation script as follows:

bash vicas/evaluation/run.sh --pred_dir /path/to/pred --gt_dir /path/to/gt --llama_ckpt_dir $LLAMA3_MODEL_DIR --split {val,test} -o /path/to/eval_output.json

The calculated scores will be printed and also written to /path/to/eval_output.json

Task-specific Evaluation

The following instructions are for evaluating just one of the tasks. In this case, the prediction files need not contain . Note that LG-VIS evaluation does not require any GPUs.

Video Captioning Only:

torchrun --nproc_per_node=8 --master_port 2222 vicas/evaluation/main.py --pred_dir /path/to/pred --gt_dir $VICAS_DIR/annotations/v0.1 --llama_ckpt_dir $LLAMA3_MODEL_DIR --split {val,test} --skip_masks -o /path/to/eval_output.json

LG-VIS Only:

python3 vicas/evaluation/main.py --pred_dir /path/to/pred --gt_dir $VICAS_DIR/annotations/v0.1 --split {val,test} --skip_captions -o /path/to/eval_output.json

For further details about the launch arguments for the eval script, run python3 vicas/evaluation/main.py --help.

⚠️ Terms of use

This dataset cannot be used for commercial purposes. It has been created for research purposes only.
This is not an official ByteDance product.

BibTeX

@article{athar2024vicas,
author = {Ali Athar, Xueqing Deng, Liang-Chieh Chen},
title = {ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation},
journal = {Arxiv},
year = {2024}
}

✨ Acknowledgements

Open-LLaVA-NeXT: We built the Video-LLaVA-Seg codebase based on this.
SAM2: We adopted this model as our segmentation network

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
assets		assets
demo_data		demo_data
splits/v0.1		splits/v0.1
vicas		vicas
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset_demo.ipynb		dataset_demo.ipynb
dataset_demo.py		dataset_demo.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

🔥 Highlights

📢 Updates

👨‍💻 Todo

🔨 Environment Setup

🎥 Demo

⏬ Dataset Download

Decode Video Frames

File Structure

Annotation Format

TL;DR for Captions Only

Chinese Captions

Benchmark Evaluation

Task-specific Evaluation

⚠️ Terms of use

BibTeX

✨ Acknowledgements

About

Releases

Packages

Languages

License

Ali2500/ViCaS

Folders and files

Latest commit

History

Repository files navigation

ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

🔥 Highlights

📢 Updates

👨‍💻 Todo

🔨 Environment Setup

🎥 Demo

⏬ Dataset Download

Decode Video Frames

File Structure

Annotation Format

TL;DR for Captions Only

Chinese Captions

Benchmark Evaluation

Task-specific Evaluation

⚠️ Terms of use

BibTeX

✨ Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages