Multi-level Knowledge Distillation via Knowledge Alignment and Correlation

This repo:

(1) covers the implementation of the following MLKD paper:

(2) benchmarks 12 state-of-the-art knowledge distillation methods in PyTorch, including:

(KD) - Distilling the Knowledge in a Neural Network
(FitNet) - Fitnets: hints for thin deep nets
(AT) - Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer
(SP) - Similarity-Preserving Knowledge Distillation
(CC) - Correlation Congruence for Knowledge Distillation
(VID) - Variational Information Distillation for Knowledge Transfer
(RKD) - Relational Knowledge Distillation
(PKT) - Probabilistic Knowledge Transfer for deep representation learning
(AB) - Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons
(FT) - Paraphrasing Complex Network: Network Compression via Factor Transfer
(FSP) - A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning
(NST) - Like what you like: knowledge distill via neuron selectivity transfer

Installation

This repo was tested with Python 3.6.10, PyTorch 1.6.0. It should be runnable with recent PyTorch versions >=1.0.0

Running

Fetch the pretrained teacher models by:
```
sh scripts/fetch_pretrained_teachers.sh
```
which will download and save the models to save/models
Run distillation by following commands in scripts/run_cifar_distill.sh. An example of running Geoffrey's original Knowledge Distillation (KD) is given by:
```
python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill kd --model_s resnet8x4 -r 0.1 -a 0.9 -b 0 --trial 1
```
where the flags are explained as:
- --path_t: specify the path of the teacher model
- --model_s: specify the student model, see 'models/__init__.py' to check the available model types.
- --distill: specify the distillation method
- -r: the weight of the cross-entropy loss between logit and ground truth, default: 1
- -a: the weight of the KD loss, default: None
- -b: the weight of other distillation losses, default: None
- --trial: specify the experimental id to differentiate between multiple runs.

Running MLKD is something like:

CUDA_VISIBLE_DEVICES=0 sh scripts/run_cifar_mlkd.sh
CUDA_VISIBLE_DEVICES=9,10,11,12,13,14,15 sh scripts/run_imagenet_mlkd.sh

Citation

If you find this repo useful for your research, please consider citing the paper


@misc{ding2021multilevel,
      title={Multi-level Knowledge Distillation via Knowledge Alignment and Correlation}, 
      author={Fei Ding and Yin Yang and Hongxin Hu and Venkat Krovi and Feng Luo},
      year={2021},
      eprint={2012.00573},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

For any questions, please contact Fei Ding.

Reference

Most code is borrowed from CRD. Please also find the pretrained teacher models in the following repos:

CRD
SSKD

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
distiller_zoo		distiller_zoo
explainability		explainability
helper		helper
models		models
scripts		scripts
LICENSE		LICENSE
README.md		README.md
explain.py		explain.py
train_student.py		train_student.py
train_student_imagenet.py		train_student_imagenet.py
train_teacher.py		train_teacher.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-level Knowledge Distillation via Knowledge Alignment and Correlation

Installation

Running

Citation

Reference

About

Releases

Packages

Languages

License

ifding/DLKD

Folders and files

Latest commit

History

Repository files navigation

Multi-level Knowledge Distillation via Knowledge Alignment and Correlation

Installation

Running

Citation

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages