You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The functional design in our original pruning.py e.g. with optimal_brain_damage() is not up to date any more.
Pytorch has a very basic implementation of pruning methods on its own (compare i.e. torch.nn.utils.prune ) and while its design allows for diverse pruning methods based on what they call "importance score" (in literature usually saliency), there seems to be mostly focus on structures/unstructures and random/magnitude-based pruning.
The saliency (importance score) usually defines per pruning step which parameters / structural elements should be masked out (= temporarily or consistently removed). Most often the saliency is simply randomly sampled with e.g. a percentage of 10% or it is based on the magnitude of the underlying parameter. But it can be also assigned based on the change in loss (then we refer to the optimal brain damage method) or even the hessian (second derivative; then we refer to the optimal brain surgeon paper). Further, Han et al (2015) showed that even just magnitude-based pruning could already be differently computed where the thresholds are either computed on a whole module or only a layer -- which changes not only the saliency but also can affect what we mean by "prune 10% based on xyz saliency".
PyTorch Ignite has a good engine design to decouple the model design and configuring a training scheme. The advantage is that a model can be independently developed/designed and the training engine can be wrapped around the model without modifying the model class. This could be a good orientation on an additional pruning engine which could work in conjunction with a training engine as to conduct a training-pruning-pipeline.
It would be good to also consider a saliency measure and an independent mask. The mask carries the actually pruned structure and has linear memory cost w.r.t. model parameters as we simply double the amount (and could even reduce by just carrying masks for actually pruned tensors). The saliency carries different information in that it is more like pytorch utils.prune importance score and could be based on the change of gradient per parameter. If the mask and saliency are separately carried, the design would allow to set the model back to its original structure or quickly extract graphs/masks per step. That might be especially interesting in the domain of Lottery Ticket experiments to reset the model to a different initialization but keeping the obtained pruned / masked structure.
The text was updated successfully, but these errors were encountered:
The functional design in our original pruning.py e.g. with
optimal_brain_damage()
is not up to date any more.Pytorch has a very basic implementation of pruning methods on its own (compare i.e. torch.nn.utils.prune ) and while its design allows for diverse pruning methods based on what they call "importance score" (in literature usually saliency), there seems to be mostly focus on structures/unstructures and random/magnitude-based pruning.
The saliency (importance score) usually defines per pruning step which parameters / structural elements should be masked out (= temporarily or consistently removed). Most often the saliency is simply randomly sampled with e.g. a percentage of 10% or it is based on the magnitude of the underlying parameter. But it can be also assigned based on the change in loss (then we refer to the optimal brain damage method) or even the hessian (second derivative; then we refer to the optimal brain surgeon paper). Further, Han et al (2015) showed that even just magnitude-based pruning could already be differently computed where the thresholds are either computed on a whole module or only a layer -- which changes not only the saliency but also can affect what we mean by "prune 10% based on xyz saliency".
PyTorch Ignite has a good engine design to decouple the model design and configuring a training scheme. The advantage is that a model can be independently developed/designed and the training engine can be wrapped around the model without modifying the model class. This could be a good orientation on an additional pruning engine which could work in conjunction with a training engine as to conduct a training-pruning-pipeline.
It would be good to also consider a saliency measure and an independent mask. The mask carries the actually pruned structure and has linear memory cost w.r.t. model parameters as we simply double the amount (and could even reduce by just carrying masks for actually pruned tensors). The saliency carries different information in that it is more like pytorch utils.prune importance score and could be based on the change of gradient per parameter. If the mask and saliency are separately carried, the design would allow to set the model back to its original structure or quickly extract graphs/masks per step. That might be especially interesting in the domain of Lottery Ticket experiments to reset the model to a different initialization but keeping the obtained pruned / masked structure.
The text was updated successfully, but these errors were encountered: