PyTorch Implementation of paper:
PAN: Towards Fast Action Recognition via Learning Persistence of Appearance
Can Zhang, Yuexian Zou*, Guang Chen and Lei Gan.
[ArXiv]
[12 Aug 2020] We have released the codebase and models of the PAN.
Efficiently modeling dynamic motion information in videos is crucial for action recognition task. Most state-of-the-art methods heavily rely on dense optical flow as motion representation. Although combining optical flow with RGB frames as input can achieve excellent recognition performance, the optical flow extraction is very time-consuming. This undoubtably will count against real-time action recognition. In this paper, we shed light on fast action recognition by lifting the reliance on optical flow. We design a novel motion cue called Persistence of Appearance (PA) that focuses more on distilling the motion information at boundaries. Extensive experiments show that our PA is over 1000x faster (8196fps vs. 8fps) than conventional optical flow in terms of motion modeling speed.
Please make sure the following libraries are installed successfully:
- PyTorch >= 1.0
- TensorboardX
- tqdm
- scikit-learn
Following the common practice, we need to first extract videos into frames for fast reading. Please refer to TSN repo for the detailed guide of data pre-processing. We have successfully trained on Kinetics, UCF101, HMDB51, Something-Something-V1 and V2, Jester datasets with this codebase. Basically, the processing of video data can be summarized into 3 steps:
-
Extract frames from videos:
-
For Something-Something-V2 dataset, please use tools/vid2img_sthv2.py
-
For Kinetics dataset, please use tools/vid2img_kinetics.py
-
-
Generate file lists needed for dataloader:
-
Each line of the list file will contain a tuple of (extracted video frame folder name, video frame number, and video groundtruth class). A list file looks like this:
video_frame_folder 100 10 video_2_frame_folder 150 31 ...
-
Or you can use off-the-shelf tools provided by other repos:
- For Something-Something-V1 & V2 datasets, please use tools/gen_label_sthv1.py & tools/gen_label_sthv2.py
- For Kinetics dataset, please use tools/gen_label_kinetics.py
-
-
Add the information to ops/dataset_configs.py
PA module aims to speed up the motion modeling procedure, it can be simply injected at the bottom of the network to lift the reliance on optical flow.
from ops.PAN_modules import PA
PA_module = PA(n_length=4) # adjacent '4' frames are sampled for computing PA
# shape of x: [N*T*m, 3, H, W]
x = torch.randn(5*8*4, 3, 224, 224)
# shape of PA_out: [N*T, m-1, H, W]
PA_out = PA_module(x) # torch.Size([40, 3, 224, 224])
VAP module aims to adaptively emphasize expressive features and suppress less informative ones by observing global information across various timescales. It is adopted at the top of the network to achieve long-term temporal modeling.
from ops.PAN_modules import VAP
VAP_module = VAP(n_segment=8, feature_dim=2048, num_class=174, dropout_ratio=0.5)
# shape of x: [N*T, D]
x = torch.randn(5*8, 2048)
# shape of VAP_out: [N, num_class]
VAP_out = VAP_module(x) # torch.Size([5, 174])
Here, we provide the pretrained models of PAN models on Something-Something-V1 & V2 datasets. Recognizing actions in these datasets requires strong temporal modeling ability, as many action classes are symmetrical. PAN achieves state-of-the-art performance on these datasets. Notably, our method even surpasses optical flow based methods while with only RGB frames as input.
Model | Backbone | FLOPs * views | Val Top1 | Val Top5 | Checkpoints |
---|---|---|---|---|---|
PANLite | ResNet-50 | 35.7G * 1 | 48.0 | 76.1 | [Google Drive] or [Weiyun] |
PANFull | 67.7G * 1 | 50.5 | 79.2 | ||
PANEn | (46.6G+88.4G) * 2 | 53.4 | 81.1 | ||
PANEn | ResNet-101 | (85.6G+166.1G) * 2 | 55.3 | 82.8 | [Google Drive] or [Weiyun] |
Model | Backbone | FLOPs * views | Val Top1 | Val Top5 | Checkpoints |
---|---|---|---|---|---|
PANLite | ResNet-50 | 35.7G * 1 | 60.8 | 86.7 | [Google Drive] or [Weiyun] |
PANFull | 67.7G * 1 | 63.8 | 88.6 | ||
PANEn | (46.6G+88.4G) * 2 | 66.2 | 90.1 | ||
PANEn | ResNet-101 | (85.6G+166.1G) * 2 | 66.5 | 90.6 | [Google Drive] or [Weiyun] |
For example, to test the PAN models on Something-Something-V1, you can first put the downloaded .pth.tar
files into the "pretrained" folder and then run:
# test PAN_Lite
bash scripts/test/sthv1/Lite.sh
# test PAN_Full
bash scripts/test/sthv1/Full.sh
# test PAN_En
bash scripts/test/sthv1/En.sh
We provided several scripts to train PAN with this repo, please refer to "scripts" folder for more details. For example, to train PAN on Something-Something-V1, you can run:
# train PAN_Lite
bash scripts/train/sthv1/Lite.sh
# train PAN_Full RGB branch
bash scripts/train/sthv1/Full_RGB.sh
# train PAN_Full PA branch
bash scripts/train/sthv1/Full_PA.sh
Notice that you should scale up the learning rate with batch size. For example, if you use a batch size of 256 you should set learning rate to 0.04.
This repository is built upon the following baseline implementations for the action recognition task.
Please [★star] this repo and [cite] the following arXiv paper if you feel our PAN useful to your research:
@misc{zhang2020pan,
title={PAN: Towards Fast Action Recognition via Learning Persistence of Appearance},
author={Can Zhang and Yuexian Zou and Guang Chen and Lei Gan},
year={2020},
eprint={2008.03462},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Or if you prefer "publication", you can cite our preliminary work on ACM MM 2019:
@inproceedings{zhang2019pan,
title={PAN: Persistent Appearance Network with an Efficient Motion Cue for Fast Action Recognition},
author={Zhang, Can and Zou, Yuexian and Chen, Guang and Gan, Lei},
booktitle={Proceedings of the 27th ACM International Conference on Multimedia},
pages={500--509},
year={2019}
}
For any questions, please feel free to open an issue or contact:
Can Zhang: [email protected]