ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation
Ali Athar, Xueqing Deng, Liang-Chieh Chen
- We introduce ViCaS, a human-annotated video dataset containing thousands of videos with detailed captions, along with pixel-precise segmentation masks for salient objects with phrase-grounding to the caption.
- Our benchmark contains two tasks: (a) Video Captioning, which evaluates high-level video understanding, and (b) Language-Guided Video Instance Segmentation (LGVIS), which evaluates finegrained, pixel-level localization based on text prompts.
- We propose Video-LLaVA-Seg, an effective baseline architecture that can tackle both of our benchmark tasks with a single, end-to-end trained model.
- 12 Dec 2024: Uploaded v0.1 of the dataset with annotations for 7,331 videos.
- Release Video-LLaVA-Seg code
- Release Video-LLaVA-Seg model weights
- Install the required packages:
pip3 install -r requirements.txt
- Install ffmpeg
sudo apt install ffmpeg
You can visualize a few samples without downloading the whole dataset. We provide a few example videos under demo_data/videos
. First, decode these videos into image frames by running:
bash demo_data/video_to_frames.sh
The frames will be saved to demo_data/video_frames
. Then you can either run the Jupyter notebook or the equivalent Python script dataset_demo.py
The annotations are hosted on HuggingFace. Clone the HF repo to a directory which we will call $VICAS_DIR
. Due to copyright reasons, we only provide the annotations (captions and segmentation masks). You have two options to obtain the videos:
Option 1: Download and preprocess the videos
Download the Oops dataset videos from here and put them under some directory $OOPS_VIDEOS_DIR
with train
and val
subdirectories. Then, run the preprocessing script:
python3 vicas/preprocess/gather_videos.py --vicas_dir $VICAS_DIR --oops_dir $OOPS_VIDEOS_DIR
This will create a directory at $VICAS_DIR/videos
and put the required videos there with the video IDs prepended to the filename.
Option 2: Download preprocessed videos
Alexey Nekrasov from the research community has been working with the dataset and was kind enough to upload his preprocessed data to HuggingFace. Clone his repo and simply put the videos
directory under $VICAS_DIR
To train and evaluate LG-VIS, you also need to decode the videos into image frames:
bash vicas/preprocess/videos_to_frames.sh $VICAS_DIR/videos $VICAS_DIR/video_frames
The image frames for each video will be saved to a directory at $VICAS_DIR/video_frames/<video_id>
.
Once the videos are downloaded and decoded into frames, the file structure for $VICAS_DIR
should look like this:
$VICAS_DIR
├── videos
│ ├── <video #1.mp4>
│ ├── <video #2.mp4>
│ ├── ...
├── video_frames
│ ├── <video #1>
│ │ └── 00000.jpg
│ │ └── 00001.jpg
│ │ └── ...
│ ├── <video #2>
│ │ └── 00000.jpg
│ │ └── 00001.jpg
│ ├── ...
├── annotations
│ ├── v0.1
│ │ └── <video #1.json>
│ │ └── <video #2.json>
│ │ └── ...
├── splits
│ ├── v0.1
│ │ └── train.json
│ │ └── val.json
│ │ └── test.json
We provide an easy-to-use API under vicas/dataset.py
to parse the dataset and its JSON annotations. Please look at the ViCaSVideo
class definition to see the JSON fields should be parsed. Refer to the Jupyter notebook or Python demo to see various use-cases for the API.
If you're only interested in the captions, just use the caption_parsed_en_gpt
value in the annotation file:
import json
with open("<VICAS_DIR>/annotations/v0.1/00000.json") as fh:
content = json.load(fh)
caption = content["caption_parsed_en_gpt"]
We also provide captions in Chinese. These were obtained by machine-translating the English captions. Note that no quality checks were applied to them so they may contain errors.
from vicas.caption_parsing import parse_caption
caption_raw = content["caption_raw_cn"] # with object-grounding syntax
caption_parsed = parse_caption(caption_raw).parsed # regular, human-readable caption
The predictions are in a per-video JSON format similar to the ground-truth. A set of ~1000 prediction files is provided in the HF repo for reference. In short, each JSON file needs to have the following fields video_id
, pred_caption
and pred_lgvis_masks
. You can inspect the example predictions to see the exact format.
Evaluate captioning accuracy requires Llama3-70B. Refer to the offical website to download the Meta-Llama-3-70B-Instruct
model checkpoint. We use the original (3.0) model version. You will need 8 GPUs to run this model. We will call the checkpoint directory $LLAMA3_MODEL_DIR
and it should contain tokenizer.model
and several .pth
files. With this setup, you can the run the evaluation script as follows:
bash vicas/evaluation/run.sh --pred_dir /path/to/pred --gt_dir /path/to/gt --llama_ckpt_dir $LLAMA3_MODEL_DIR --split {val,test} -o /path/to/eval_output.json
The calculated scores will be printed and also written to /path/to/eval_output.json
The following instructions are for evaluating just one of the tasks. In this case, the prediction files need not contain . Note that LG-VIS evaluation does not require any GPUs.
- Video Captioning Only:
torchrun --nproc_per_node=8 --master_port 2222 vicas/evaluation/main.py --pred_dir /path/to/pred --gt_dir $VICAS_DIR/annotations/v0.1 --llama_ckpt_dir $LLAMA3_MODEL_DIR --split {val,test} --skip_masks -o /path/to/eval_output.json
- LG-VIS Only:
python3 vicas/evaluation/main.py --pred_dir /path/to/pred --gt_dir $VICAS_DIR/annotations/v0.1 --split {val,test} --skip_captions -o /path/to/eval_output.json
For further details about the launch arguments for the eval script, run python3 vicas/evaluation/main.py --help
.
- This dataset cannot be used for commercial purposes. It has been created for research purposes only.
- This is not an official ByteDance product.
@article{athar2024vicas,
author = {Ali Athar, Xueqing Deng, Liang-Chieh Chen},
title = {ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation},
journal = {Arxiv},
year = {2024}
}
- Open-LLaVA-NeXT: We built the Video-LLaVA-Seg codebase based on this.
- SAM2: We adopted this model as our segmentation network