This is the pytorch implementation of our work at ECCV2020 (Spotlight). The repository mainly includes 3 parts: (1) Extract RoI feature; (2) Train and inference; and (3) Generate relation-aware trajectories.
Fix issue on unstable result [2021/10/07].
Anaconda 3, python 3.6.5, pytorch 0.4.1 (Higher version is OK once feature is ready) and cuda >= 9.0. For others libs, please refer to the file requirements.txt.
Please create an env for this project using anaconda3 (should install anaconda first)
>conda create -n envname python=3.6.5 # Create
>conda activate envname # Enter
>pip install -r requirements.txt # Install the provided libs
>sh vRGV/lib/make.sh # Set the environment for detection, make sure you have nvcc
Please download the data here. The folder ground_data
should be at the same directory as vRGV
. Please merge the downloaded vRGV folder with this repo.
Please download the videos here and extract the frames into ground_data. The directory should be like: ground_data/vidvrd/JPEGImages/ILSVRC2015_train_xxx/000000.JPEG.
Feature Extraction. (need about 100G storage! Because I dumped all the detected bboxes along with their features. It can be greatly reduced by changing detect_frame.py to return the top-40 bboxes and save them with .npz file.)
./detection.sh 0 val #(or train)
Sample video features:
cd tools
python sample_video_feature.py
Test. You can use our provided model to verify the feature and environment:
./ground.sh 0 val # Output the relation-aware spatio-temporal attention
python generate_track_link.py # Generate relation-aware trajectories with Viterbi algorithm.
python eval_ground.py # Evaluate the performance
You will get accuracy Acc_R: 24.58%.
Train. If you want to train the model from scratch. Please apply a two-stage training scheme: 1) train a basic model without relation attendance, and 2) load the reconstruction part of the pre-trained model to learn the whole model (with the same lr_rate). For implementation, please turn off/on [pretrain] in line 52 of ground.py
, and switch between line 6 & 7 in ground_relation.py
for 1st & 2nd stage training respectively. Also, you need to change the model files in line 69 & 70 of ground_relation.py
to the best model obtained at the first stage for 2nd-stage training.
./ground.sh 0 train # Train the model with GPU id 0
The results maybe slightly different (+/-0.5%), For comparison, please follow the results reported in our paper.
Query | bicycle-jump_beneath-person | person-feed-elephant | person-stand_above-bicycle | dog-watch-turtle |
---|---|---|---|---|
Result | ||||
Query | person-ride-horse | person-ride-bicycle | person-drive-car | bicycle-move_toward-car |
Result |
@inproceedings{xiao2020visual,
title={Visual Relation Grounding in Videos},
author={Xiao, Junbin and Shang, Xindi and Yang, Xun and Tang, Sheng and Chua, Tat-Seng},
booktitle={European Conference on Computer Vision},
pages={447--464},
year={2020},
organization={Springer}
}
NUS © NExT++