Distily

In one command, distill an existing LLM into a smaller or different architecture.

Install

pip install -U "git+https://github.com/lapp0/distily.git#egg=distily[full]"

Features

Distily allows you to distill a model with

Quantized weights: e.g. TriLM, bitnet
Distinct architecture: State-Space models such as Mamba, Mixture-of-Experts (MoE)
Modified architecture: Decrease (or increase) the
- number of layers
- width and depth of attention heads and dense layer.
- the number of attention and KV heads.

Usage

Minimal Example: distily_gpt2

Command to create a distilled gpt2 with only 6 layers:

python3 -m distily.run \
    --teacher_model_name_or_path gpt2 \
    --output_dir distily_gpt2 \
    --hub_model_id "distily/distily_gpt2" \
    --push_to_hub True \
    --student_model_config {"n_layers": 6} \
    --student_model_as_bitnet True

The Resulting distily_gpt2 Model has (TODO: explain metrics).

For more examples, review the Examples documentation.

Note on Hub Credentials

To push to hub, you must prepare your hub token

HF_WRITE=<your hub token> python3 -c "from huggingface_hub.hf_api import HfFolder; HfFolder.save_token('${HF_WRITE}')"

Roadmap

Improved performance / sampling efficiency:

Standard knowledge distillation using logits.
Distill using intermediate features including hidden states and attentions.
Implement Value-Transfer (simply distillation loss on v of q,k,v)
Improve sampling efficiency through synthetic data generation.
Implement cross-entropy classification loss (traditional LLM loss function)
Apply projector to logits (https://arxiv.org/pdf/2310.17183)
Apply "teacher recording", run teacher inference once, use features dataset any number of times.

Distill to a different model shape / size:

Distill to model with fewer num_hidden_layers by implementing layer mappers.
Distill to a model with modified module dimensions and behaviors (e.g., intermediate_size, hidden_act) by employing projectors.
Distill to a model with modified num_attention_heads and num_key_value_heads.

Distill to a different architecture:

Distill to Bitnet (b1.58)
Distill to State-Space / Mamba
Distill to MoE
Distill to Parameter Sharing (ALBERT-style) Model

Name		Name	Last commit message	Last commit date
Latest commit History 517 Commits
distily		distily
scripts		scripts
tests		tests
LICENSE		LICENSE
pyproject.toml		pyproject.toml
readme.md		readme.md
todo.md		todo.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distily

In one command, distill an existing LLM into a smaller or different architecture.

Install

Features

Usage

Note on Hub Credentials

Further Reading

Roadmap

Improved performance / sampling efficiency:

Distill to a different model shape / size:

Distill to a different architecture:

Additional Techniques:

About

Releases

Packages

Languages

License

lapp0/distily

Folders and files

Latest commit

History

Repository files navigation

Distily

In one command, distill an existing LLM into a smaller or different architecture.

Install

Features

Usage

Note on Hub Credentials

Further Reading

Roadmap

Improved performance / sampling efficiency:

Distill to a different model shape / size:

Distill to a different architecture:

Additional Techniques:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages