Skip to content

Compressing deep neural networks with pruning, trained quantization and Huffman coding

License

Notifications You must be signed in to change notification settings

angelocatalani/neural-network-compression

Repository files navigation

Neural Networks Compression

Python Linting Code style: black Checked with mypy Imports: isort

Implementation of neural network compression from Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding and Learning both Weights and Connections for Efficient Neural Networks.

Quick presentation.

Detailed report.

Table of Contents

Abstract

Neural networks are both computationally and memory intensive. This is why it is difficult to deploy then on embedded systems with limited hardware resources.

The aim of this project is to compress a neural network with pruning and quantization without accuracy degradation.

The experiments are executed on the MNIST classification problem, with the following neural networks: LeNet300-100 and LeNet5.

Pruning

The pruning phase consists of three steps :

  • train connectivity as usual

  • prune connection: remove all the weights below a threshold

  • train the weights: re-train the neural network and repeat from step 2

prune-view

It is significant to note that:

  • the first step is conceptually different from the way a neural network is normally trained because in this step we are interested in finding the important connection rather than the final weights
  • retraining the pruned neural network is necessary because after the remotion of some connection the accuracy will inevitably drop
  • pruning works under the hypothesis that the network is over-parametrized so that it solves not only memory isssues but also it can reduce the risk of overfitting

The regularization terms used in the loss function, tends to lower the magnitude of the weight matrices, so that more weights close to zero will become good candidates to be pruned.

For my experiments, I used the L2 regularization .

The threshold value for pruning, is obtained as a quality parameter multiplied by the standard deviation of the layer's weights. This choice is justified by the fact that as it is the case of my experiments, the weights of a dense/convolutional layers are distributed as a gaussian of zero mean, so that the weights in the range of the positive and negative standard deviation are the 68% of the total.

weigth distribution for a dense layer of Lenet300100 before pruning

Quantization

After pruning, the network is further compressed by reducing the number of bits to represent the single weights.

In particular I applied k-means to the weights of each layer to cluster them into representative centroids.

If we want to quantise the weights with n bits, we can use up to 2*n centroids. more bits => better precision => more memory impact

There are 3 ways to initialize centroids:

  • forgy: random choice among the weights

  • density based: consider the cumulative distribution of weights (cdf) and takes the x-values at different fixed y-values (cdf)

  • linear: consider equal sized intervals as centroids between the minimum and maximum weight

To fully differentiate the initialization methods, it is important to note the weights of a single layer are distributed as a bimodal distribution after the pruning.

weights after pruning for a dense layer of Lenet300100

Cumulative weight distribution for a dense layer of Lenet300100

This means that:

  1. forgy and density based, will place the weights around the two peaks because it is where weights are concentrated and the cdf varies the most.

    The drawback is that very few centroids will have large absolute value which results in poor representation of the few large weights.

  2. linear, equally space the range of weights, so it does not suffer from the previous problem

Experiments

To run the experiments, first install and configure poetry:

curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
poetry config virtualenvs.in-project true

Then clone the repository:

git clone [email protected]:angelocatalani/neural-network-compression.git

and change directory:

cd neural-network-compression

Then, install dependencies:

poetry install

The main.py contains some experiments with LeNet300100, multiple threshold values and different k-means initialisation mode.

The following code generates the experiment results in the folder: neural_network_compression/LeNet300100_2BitsDensityQuantization/

if __name__ == "__main__":
    run_experiment_with_lenet300100(
        train_epochs=2,
        prune_train_epochs=2,
        semi_prune_train_epochs=2,
        maximum_centroid_bits=2,
        k_means_initialization_mode="density",
        with_cumulative_weight_distribution=True,
        experiment_name="2BitsDensityQuantization",
    )

To run the experiments:

poetry run neural_network_compression/main.py

TODO: add experiments with LeNet5

References

About

Compressing deep neural networks with pruning, trained quantization and Huffman coding

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published