BatchNorm Pruning: Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers
May 2020
tl;dr: Similar idea to Network Slimming but with more details.
Two questions to answer:
- Can we set wt < thresh to zero. If so, under what constraints?
- Can we set a global thresh to diff layers?
Many previous works are norm-based pruning, which do not have solid theoretical foundation. One cannot assign different weights to the Lasso regularization to diff layers, as we can perform model reparameterization to reduce Lasso loss. In addition, in the presence of BN, any linear scaling of W will not change results.
- This paper (together with concurrent work of Network Slimming) focuses on sparsifying the gamma value in BN layer.
- gamma works on top of normalized random variable and thus comparable across layers.
- The impact of gamma is independent across diff layers.
- A regularization term based on L1 of gamma is introduced, but scaled by a per layer factor
$\lambda$ . The global weight of the regularization term is$\rho$ . - ISTA (Iterative Shrinkage-Thresholding Algorithm) is better than gradient descent.
- Summary of technical details
- Questions and notes on how to improve/revise the current work