Skip to content
Dominik edited this page Jun 11, 2015 · 10 revisions

State of the art video classification

UFC-101 Accuracy (3-Fold) Notes
Modeling Spatial-Temporal Clues (Wu) 91.3 3 parts: Spatial LSTM, Motion LSTM and Fusion of Spatial/Motion CNN's
LRCN+CNN (Donahue) 82.92 Weighted average of RGB (1/3) and Flow (2/3) networks. LRCN after first fully connected CNN Layer
2stream CNN (Simonyan) Poster 88.0 Temporal + Spatial ConvNet. Fusion using SVM. Multi-task learning for temporal ConvNet. SpatialConv net pre-trained on ILSVRC-2012 and fine-tuning only on last layer.
LSTM + 30 Frame Unroll (Yue-Hei Ng) 88.6   Optical Flow + Image Frames. 1 FPS + Motion information through flow. Re-used GoogLeNet. LSTM performed better than feature pooling architecture.
Evaluating Two-Stream CNN (Ye) 87.7 Takes VGG19 and CNN_M an fine tunes (plus more)
Slow Fusion (Karpathy) 65.4 Trained on 1M sport videos first and then used transfer learning. They used multiresolution CNNs (fovea and context stream) and slow fusion.

Clone this wiki locally