Reference ImageNet implementation of SelecSLS Convolutional Neural Network architecture proposed in XNect: Real-time Multi-Person 3D Motion Capture with a Single RGB Camera (SIGGRAPH 2020).
The network architecture is 1.3-1.5x faster than ResNet-50, particularly for larger image sizes, with the same level of accuracy on different tasks! Further, it takes substantially less memory while training, so it can be trained with larger batch sizes!
Better and more accurate models / snapshots are now available. See the additional ImageNet table below.
Code for pruning the model based on Implicit Filter Level Sparsity is also a part of the SelecSLS model now. The sparsity is a natural consequence of training with adaptive gradient descent approaches and L2 regularization. It gives a further speedup of 10-30% on the pretrained models with no loss in accuracy. See usage and results below.
The inference time for the models in the table below is measured on a TITAN X GPU using the accompanying scripts. The accuracy results for ResNet-50 are from torchvision, and the accuracy results for VoVNet-39 are from VoVNet.
* (P) indicates that the model has batch norm fusion and pruning appliedForward Pass Time (ms) for different image resolutions |
ImageNet Error |
|||||||
---|---|---|---|---|---|---|---|---|
512x512 | 400x400 | 224x224 | Top-1 | Top-5 | ||||
Batch Size | 1 | 16 | 1 | 16 | 1 | 16 | ||
ResNet-50 | 15.0 | 175.0 | 11.0 | 114.0 | 7.2 | 39.0 | 23.9 | 7.1 |
VoVNet-39 | 13.0 | 197.0 | 10.8 | 130.0 | 6 | 41.0 | 23.2 | 6.6 |
SelecSLS-60 | 11.0 | 115.0 | 9.5 | 85.0 | 7.3 | 29.0 | 23.8 | 7.0 |
SelecSLS-60 (P) | 10.2 | 102.0 | 8.2 | 71.0 | 6.1 | 25.0 | 23.8 | 7.0 |
SelecSLS-84 | 16.1 | 175.0 | 13.7 | 124.0 | 9.9 | 42.3 | 23.3 | 6.9 |
SelecSLS-84 (P) | 11.9 | 119.0 | 10.1 | 82.0 | 7.6 | 28.6 | 23.3 | 6.9 |
The following models are trained using Cosine LR, Random Erasing, EMA, Bicubic Interpolation, and Color Jitter using rwightman/pytorch-image-models. The inference time for models here is measured on a TITAN Xp GPU using the accompanying scripts. The script for evaluating ImageNet performance uses Bilinear interpolation, hence the results reported here are marginally worse than they would be with Bicubic interpolation at inference.
Forward Pass Time (ms) for different image resolutions |
ImageNet Error |
|||||||
---|---|---|---|---|---|---|---|---|
512x512 | 400x400 | 224x224 | Top-1 | Top-5 | ||||
Batch Size | 1 | 16 | 1 | 16 | 1 | 16 | ||
SelecSLS-42_B | 6.4 | 60.8 | 5.8 | 42.1 | 5.7 | 14.7 | 22.9 | 6.6 |
SelecSLS-60 | 7.4 | 69.4 | 7.3 | 47.6 | 7.1 | 16.8 | 22.1 | 6.1 |
SelecSLS-60_B | 7.5 | 70.5 | 7.3 | 49.3 | 7.2 | 17.0 | 21.6 | 5.8 |
The key feature of the proposed architecture is that unlike the full dense connectivity in DenseNets, SelecSLS uses a much sparser skip connectivity pattern that uses both long and short-range concatenative-skip connections. Additionally, the network architecture is more amenable to filter/channel pruning than ResNets. You can find more details about the architecture in the following paper, and details about implicit pruning in the CVPR 2019 paper.
Another recent paper proposed the VoVNet architecture, which shares some design similarities with our architecture. However, as shown in the above table, our architecture is significantly faster than both VoVNet-39 and ResNet-50 for larger batch sizes as well as larger image sizes.
This repo provides the model definition in Pytorch, trained weights for ImageNet, and code for evaluating the forward pass time and the accuracy of the trained model on ImageNet validation set. In the paper, the model has been used for the task of human pose estimation, and can also be applied to a myriad of other problems as a drop in replacement for ResNet-50.
wget http://gvv.mpi-inf.mpg.de/projects/XNectDemoV2/content/SelecSLS60_statedict.pth -O ./weights/SelecSLS60_statedict.pth
python evaluate_timing.py --num_iter 100 --model_class selecsls --model_config SelecSLS60 --model_weights ./weights/SelecSLS60_statedict.pth --input_size 512 --gpu_id <id>
python evaluate_imagenet.py --model_class selecsls --model_config SelecSLS60 --model_weights ./weights/SelecSLS60_statedict.pth --gpu_id <id> --imagenet_base_path <path_to_imagenet_dataset>
#For pruning the model, and evaluating the pruned model (Using SelecSLS60 or other pretrained models)
python evaluate_timing.py --num_iter 100 --model_class selecsls --model_config SelecSLS84 --model_weights ./weights/SelecSLS84_statedict.pth --input_size 512 --pruned_and_fused True --gamma_thresh 0.001 --gpu_id <id>
python evaluate_imagenet.py --model_class selecsls --model_config SelecSLS84 --model_weights ./weights/SelecSLS84_statedict.pth --pruned_and_fused True --gamma_thresh 0.001 --gpu_id <id> --imagenet_base_path <path_to_imagenet_dataset>
- Python 3.5
- Pytorch >= 1.1
The contents of this repository, and the pretrained models are made available under CC BY 4.0. Please read the license terms.
If you use the model or the implicit sparisty based pruning in your work, please cite:
@inproceedings{XNect_SIGGRAPH2020,
author = {Mehta, Dushyant and Sotnychenko, Oleksandr and Mueller, Franziska and Xu, Weipeng and Elgharib, Mohamed and Fua, Pascal and Seidel, Hans-Peter and Rhodin, Helge and Pons-Moll, Gerard and Theobalt, Christian},
title = {{XNect}: Real-time Multi-Person {3D} Motion Capture with a Single {RGB} Camera},
journal = {ACM Transactions on Graphics},
url = {http://gvv.mpi-inf.mpg.de/projects/XNect/},
numpages = {17},
volume={39},
number={4},
month = July,
year = {2020},
doi={10.1145/3386569.3392410}
}
@InProceedings{Mehta_2019_CVPR,
author = {Mehta, Dushyant and Kim, Kwang In and Theobalt, Christian},
title = {On Implicit Filter Level Sparsity in Convolutional Neural Networks},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2019}
}