Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
- NVIDIA GPU Ampere architecture or above
- CUDA 12.6
- python 3.12
- CMAKE version 3.22.1 or above
- PyTorch 2.7.1
- Triton 3.3.1
- To clone the repository with all submodules:
git clone --recurse-submodules git@github.com:enyac-group/UniQL.git
cd UniQL
# or
git clone git@github.com:enyac-group/UniQL.git
cd UniQL
git submodule update --init --recursive- Run in docker (optional) As our implementation includes customized CUDA kernels and depends on specific CUDA version, user may optionally run our code in docker. To build the docker image, run the following command:
cd docker
./build_docker.shAfter building the docker image, user can run the docker container with the following command:
./run.sh- Create UniQL conda environment
cd UniQL
conda create -n UniQL python=3.12
conda activate UniQL
pip install -r requirements.txt- Install
lm-evaluation-harness:
pip install 3rdparty/lm-evaluation-harness- Install
fast-hadamard-transform:
# set force build to include 12N, 40N from the newer commit
export FAST_HADAMARD_TRANSFORM_FORCE_BUILD=TRUE
pip install 3rdparty/fast-hadamard-transform --no-build-isolation- Install
peftfrom our commit:
# we fix peft for mamba blocks
pip install 3rdparty/peft --no-build-isolation- (Optional) Build
causal_conv1dfrom source (or you can install from the prebuilt wheels):
# build from the local clone
pip install 3rdparty/causal-conv1d --no-build-isolation- (Optional) Build
mamba_ssmfrom source (or you can install from the prebuilt wheels):
# build from the local clone
export MAMBA_FORCE_BUILD=TRUE
pip install 3rdparty/mamba --no-build-isolation- (Optional) Build
flash_attnfrom source (or you can install from the prebuilt wheels):
# build from the local clone
cd 3rdparty/flash-attention
MAX_JOBS=16 python setup.py installpip install -e . --no-build-isolation
The quantized models are available on Hugging Face π€.
- Qwen/Qwen2.5-7B
- Qwen/Qwen2.5-7B-Instruct
- meta-llama/Llama-3.1-8B
- meta-llama/Llama-3.1-8B-Instruct
- meta-llama/Llama-2-7b-hf
- ibm-ai-platform/Bamba-9B-v2
- nvidia/Nemotron-H-8B-Base-8K
- mamba2-8b (convert from nvidia/mamba2-8b-3t-4k)
CUDA_VISIBLE_DEVICES=0 bash scripts/run_structured_sorting.sh meta-llama/Llama-3.1-8BFor this example, the sorted model will be stored at this path pretrained_models/ut-enyac/Llama-3.1-8B-uniql-1.0. 1.0 here means we don't prune the weights and only sort the weights in the model. Please see compress/compress_models.py for more details.
CUDA_VISIBLE_DEVICES=0 bash scripts/run_masked_finetuning.sh ut-enyac/Llama-3.1-8B-uniql-1.0If the pruning ratios are not found under compress/output/{model name}, the script will run compress/get_layer_ratios.py first. For example, the layerwise pruning ratios will be stored at compress/outputs/llama-3.1-8b-uniql-1.0/.
The fine-tuned model will be stored at this path pretrained_models/ut-enyac/Llama-3.1-8B-uniql-1.0-masked-lora-rft.
CUDA_VISIBLE_DEVICES=0 bash scripts/run_quantization.sh ut-enyac/Llama-3.1-8B-uniql-1.0-masked-lora-rftFor this example, the low-rank model will be stored at this path pretrained_models/ut-enyac/Llama-3.1-8B-uniql-1.0-masked-lora-rft-w4a16
# standard zero-shot
CUDA_VISIBLE_DEVICES=0 python main.py ut-enyac/Llama-3.1-8B-uniql-1.0-masked-lora-rft-w4a16 --batch_size 16 --eval --fewshot 0 --task_list hellaswag,arc_easy,arc_challenge,piqa,winogrande --pretrained_dir ./pretrained_models/ --layer_ratio_config ./compress/outputs/llama-3.1-8b-uniql-1.0/layerwise_eps-0.1_ratio-0.85.json# we use batch size 8 to save memory
CUDA_VISIBLE_DEVICES=0 python main.py ut-enyac/Llama-3.1-8B-uniql-1.0-masked-lora-rft-w4a16 --batch_size 8 --eval --fewshot 5 --task_list mmlu --pretrained_dir ./pretrained_models/ --layer_ratio_config ./compress/outputs/llama-3.1-8b-uniql-1.0/layerwise_eps-0.1_ratio-0.85.json- To profile model size, use
--size:
python profile_model.py ut-enyac/Llama-3.1-8B-uniql-1.0-masked-lora-rft-w4a16 --size --batch_size 1 --pretrained_dir pretrained_models/- To profile prefilling latency (i.e., time-to-first-token), use
--ttft:
# ttft does not support --cache_graph
python profile_model.py ut-enyac/Llama-3.1-8B-uniql-1.0-masked-lora-rft-w4a16 --ttft --batch_size 1 --prompt_len 1024 --pretrained_dir pretrained_models/- To profile generation latency (i.e., decoding phase, time-per-output-token), use
--tpot:
python profile_model.py ut-enyac/Llama-3.1-8B-uniql-1.0-masked-lora-rft-w4a16 --tpot --batch_size 1 --cache_graph --prompt_len 1024 --pretrained_dir pretrained_models/- To profile end-to-end latency (i.e., prefilling + decoding time, time-to-last-token), use
--ttlt:
# profiling ttlt might take more time, so we report the average latency of 20 profilings
python profile_model.py ut-enyac/Llama-3.1-8B-uniql-1.0-masked-lora-rft-w4a16 --ttlt --batch_size 1 --cache_graph --prompt_len 1024 --gen_len 1024 --repeats 20 --pretrained_dir pretrained_models/Please use --help to see more profiling configurations.
Download the checkpoint using huggingface-cli
huggingface-cli download nvidia/mamba2-8b-3t-4k --local-dir ./pretrained_models/mamba2-8b-3t-4kAfter downloading, you will have the directory ./pretrained_models/mamba2-8b-3t-4k having a structure like this
βββ latest_checkpointed_iteration.txt
βββ mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model (This is tokenizer)
βββ README.md
βββ release
βββ mp_rank_00
βββ model_optim_rng.pt (This is weights)- Run the conversion scripts to get the model directory
python convert_mamba2_8b_to_hf.py \
./pretrained_models/mamba2-8b-3t-4k/release/mp_rank_00/model_optim_rng.pt \
./pretrained_models/mamba2-8b-3t-4k/mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model \
--model_save_path ./pretrained_models/ut-enyac/mamba2-8b-converted@inproceedings{chiang2026uniql,
title = {UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs},
author = {Chiang, Hung-Yueh and Chang, Chi-Chih and Lu, Yu-Chen and Lin, Chien-Yu and Wu, Kai-Chiang and Abdelfattah, Mohamed S. and Marculescu, Diana},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026},
}
