Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@

AutoRound is an advanced quantization toolkit designed for Large Language Models (LLMs) and Vision-Language Models (VLMs).
It achieves high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging **sign-gradient descent** and providing broad hardware compatibility.
See our [paper](https://arxiv.org/pdf/2309.05516) for more details. For usage instructions, please refer to the [User Guide](./docs/step_by_step.md).
See our papers [SignRoundV1](https://arxiv.org/pdf/2309.05516) and [SignRoundV2](http://arxiv.org/abs/2512.04746) for more details. For usage instructions, please refer to the [User Guide](./docs/step_by_step.md).

<p align="center">
<img src="docs/imgs/autoround_overview.png" alt="AutoRound Overview" width="80%">
Expand All @@ -39,7 +39,7 @@ See our [paper](https://arxiv.org/pdf/2309.05516) for more details. For usage in

* [2025/10] AutoRound has been integrated into **SGLang**: [*Usage*](https://docs.sglang.io/advanced_features/quantization.html#using-auto-round), [*LMSYS Blog*](https://lmsys.org/blog/2025-11-13-AutoRound/), [*X post*](https://x.com/lmsysorg/status/1991977019220148650?s=20), [*Intel blog*](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/AutoRound-Meets-SGLang-Enabling-Quantized-Model-Inference-with/post/1727196), [*Linkedin*](https://www.linkedin.com/feed/update/urn:li:activity:7397742859354857472).

* [2025/10] A **mix precision** algorithm is available to generate schemes in minutes: [*Usage*](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme), [*Accuracy*](./docs/auto_scheme_acc.md).
* [2025/10] A **mixed precision** algorithm is available to generate schemes in minutes: [*Usage*](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme), [*Accuracy*](./docs/auto_scheme_acc.md).

* [2025/09] **MXFP4** and **NVFP4** dtypes is available: [*Accuracy*](./docs/mxnv_acc.md).

Expand Down Expand Up @@ -339,8 +339,15 @@ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))


## Publications & Events
[SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs](https://arxiv.org/abs/2512.04746) (202512 paper)

[Publication List](./docs/publication_list.md).
[Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLM](https://aclanthology.org/2024.findings-emnlp.662/) (202309 paper)

[TEQ: Trainable Equivalent Transformation for Quantization of LLMs](https://arxiv.org/abs/2310.10944) (202310 paper)

[Effective Post-Training Quantization for Large Language Models](https://medium.com/intel-analytics-software/effective-post-training-quantization-for-large-language-models-with-enhanced-smoothquant-approach-93e9d104fb98) (202304 blog)

Check out [Full Publication List](./docs/publication_list.md).

## Acknowledgement
Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.
Expand Down
3 changes: 2 additions & 1 deletion auto_round/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,7 @@ def __init__(self, *args, **kwargs):
help="Quantize the lm_head. " "Usually kept in higher precision for better output quality.",
)
scheme.add_argument(
"--ignore_layers",
"--fp_layers",
default="",
type=str,
Expand Down Expand Up @@ -599,7 +600,7 @@ def tune(args):
super_bits=args.super_bits,
super_group_size=args.super_group_size,
quant_lm_head=args.quant_lm_head,
fp_layers=args.fp_layers,
ignore_layers=args.ignore_layers,
static_kv_dtype=args.static_kv_dtype,
static_attention_dtype=args.static_attention_dtype,
)
Expand Down
56 changes: 32 additions & 24 deletions auto_round/autoround.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,9 +82,7 @@ def __new__(
device_map: Union[str, torch.device, int, dict] = 0,
enable_torch_compile: bool = False,
seed: int = 42,
# for adam
enable_adam: bool = False,
# for MLLM and Diffusion
extra_config: ExtraConfig = None,
enable_alg_ext: bool = None,
disable_opt_rtn: bool = None,
Expand All @@ -96,36 +94,42 @@ def __new__(
Args:
model (torch.nn.Module | str): Model object or model name to load.
tokenizer: Tokenizer for text processing. Required if `model` is not a string and `iters > 0`.
platform: The platform to download pretrained model, options: ["hf", "model_scope"]
scheme (str| dict | QuantizationScheme ): A preset scheme that defines the quantization configurations
bits (int, optional): Weight quantization bits. Defaults to 4.
group_size (int, optional): Weight quantization group size. Defaults to 128.
sym (bool, optional): Symmetric weight quantization. Defaults to True.
layer_config (dict, optional): Layer-wise quantization config. Defaults to None.
batch_size (int, optional): Calibration batch size. Defaults to 8.
amp (bool, optional): Use AMP for tuning. Defaults to True.
device (str | torch.device | int, optional): Compute device. Defaults to 0.
dataset (str | list | tuple | DataLoader, optional): Calibration data. Defaults to "NeelNanda/pile-10k".
enable_minmax_tuning (bool, optional): Enable weight min-max tuning. Defaults to True.
lr (float, optional): Learning rate; if None, set to 1.0 / iters except when iters==0.
minmax_lr (float, optional): Learning rate for min-max tuning; defaults to `lr`.
low_gpu_mem_usage (bool, optional): Lower GPU memory mode. Defaults to False.
low_cpu_mem_usage (bool, optional): Lower CPU memory mode. Defaults to False.
iters (int, optional): Optimization iterations. Defaults to 200.
seqlen (int, optional): Calibration sequence length. Defaults to 2048.
nsamples (int, optional): Number of calibration samples. Defaults to 128.
seed (int, optional): Random seed. Defaults to 42.
batch_size (int, optional): Calibration batch size. Defaults to 8.
gradient_accumulate_steps (int, optional): Gradient accumulation steps. Defaults to 1.
low_gpu_mem_usage (bool, optional): Lower GPU memory mode. Defaults to False.
device_map (str | dict, optional): Device map for each module. Defaults to 0.
enable_torch_compile (bool, optional): Enable torch.compile for low cost in quantization. Defaults to False.
seed (int, optional): Random seed. Defaults to 42.
enable_adam (bool, optional): Enable Adam-based optimizer. Defaults to False.
extra_config(ExtraConfig, optional): Extra configuration for lots of configurations. Defaults to None.
enable_alg_ext (bool, optional): Enable algorithm extension (primarily for INT2)
for better accuracy. Defaults to False.
disable_opt_rtn (bool, optional): Disable RTN-mode optimization (iters=0) for fast quatnziation
with lower accuracy. Defaults to False.
low_cpu_mem_usage (bool, optional): Lower CPU memory mode. Defaults to False.

bits (int, optional): Weight quantization bits. Defaults to 4.
group_size (int, optional): Weight quantization group size. Defaults to 128.
sym (bool, optional): Symmetric weight quantization. Defaults to True.
data_type (str, optional): Weight data type string, e.g., "int". Defaults to "int".
act_bits (int, optional): Activation quantization bits. Defaults to 16.
act_group_size (int, optional): Activation group size. Defaults to None.
act_sym (bool, optional): Symmetric activation quantization. Defaults to None.
act_data_type (str, optional): Activation data type; inherits weight dtype if None and act_bits < 16.
act_dynamic (bool, optional): Dynamic activation quantization. Defaults to True.
enable_torch_compile (bool, optional): Enable torch.compile for quant blocks/layers. Defaults to False.
device_map (str | dict, optional): Device placement map. Defaults to None.
disable_opt_rtn (bool, optional): Disable RTN-mode optimization (iters=0). Defaults to False.
enable_alg_ext (bool, optional): Enable algorithm extension (primarily for INT2). Defaults to False.
model_dtype (str): model dtype used to load pre-trained model.
amp (bool, optional): Use AMP for tuning. Defaults to True.
enable_minmax_tuning (bool, optional): Enable weight min-max tuning. Defaults to True.
lr (float, optional): Learning rate; if None, set to 1.0 / iters except when iters==0.
minmax_lr (float, optional): Learning rate for min-max tuning; defaults to `lr`.

**kwargs: Backward compatible options:
- enable_alg_ext, quant_lm_head, lr, lr_scheduler, sampler, not_use_best_mse, dynamic_max_gap,
super_group_size, super_bits, scale_dtype ("fp16" etc.),
Expand All @@ -140,18 +144,17 @@ def __new__(

>>> layer_config = {
... "layer1": {
... "data_type": "int",
... "bits": 4,
... "bits": 3,
... "group_size": 128,
... "sym": True,
... "act_data_type": None,
... "act_bits": 16,
... "act_group_size": None,
... "act_sym": None,
... },
... "layer2": {
... "W8A16"
... }
... # ...
... }
"""

model_cls = []

if (extra_config and not extra_config.mllm_config.is_default()) or is_mllm_model(model, platform=platform):
Expand Down Expand Up @@ -179,6 +182,11 @@ def __new__(
kwargs["enable_alg_ext"] = enable_alg_ext
if disable_opt_rtn is not None:
kwargs["disable_opt_rtn"] = disable_opt_rtn
if "fp_layers" in kwargs:
logger.warning_once(
"'fp_layers' is deprecated, please use 'ignore_layers' to set layers not to be quantized."
)
kwargs["ignore"] = kwargs.pop("fp_layers")
ar = dynamic_compressor(
model=model,
tokenizer=tokenizer,
Expand Down
6 changes: 3 additions & 3 deletions auto_round/compressors/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -312,7 +312,7 @@ def __init__(
self.platform = platform
self.quant_lm_head = kwargs.pop("quant_lm_head", False)

self.fp_layers = kwargs.pop("fp_layers", "")
self.ignore_layers = kwargs.pop("ignore_layers", "")
self.supported_types = SUPPORTED_LAYER_TYPES
self.inner_supported_types = INNER_SUPPORTED_LAYER_TYPES
self.scale_dtype = convert_dtype_str2torch(scale_dtype)
Expand Down Expand Up @@ -504,7 +504,7 @@ def _gen_auto_scheme(
self.supported_types,
self.inner_supported_types,
self.quant_block_list,
self.fp_layers,
self.ignore_layers,
self.quant_lm_head,
enable_gguf_official_mixed=False,
is_mllm=self.mllm,
Expand Down Expand Up @@ -1398,7 +1398,7 @@ def configure_layer_config(self, enable_gguf_official_mixed: None | bool = True)
self.supported_types,
self.inner_supported_types,
self.quant_block_list,
self.fp_layers,
self.ignore_layers,
self.quant_lm_head,
enable_gguf_official_mixed=enable_gguf_official_mixed,
is_mllm=self.mllm,
Expand Down
6 changes: 3 additions & 3 deletions auto_round/compressors/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def __init__(
super_group_size: int = None,
static_kv_dtype: Union[str, torch.dtype] = None,
quant_lm_head: bool = False,
fp_layers: str = None,
ignore_layers: str = None,
# mllm
processor: Callable = None,
image_processor: Callable = None,
Expand Down Expand Up @@ -139,7 +139,7 @@ def __init__(
super_group_size=super_group_size,
static_kv_dtype=static_kv_dtype,
quant_lm_head=quant_lm_head,
fp_layers=fp_layers,
ignore_layers=ignore_layers,
)
self.mllm_config = MLLMExtraConfig(
processor=processor,
Expand Down Expand Up @@ -277,7 +277,7 @@ class SchemeExtraConfig(BaseExtraConfig):
static_kv_dtype: Union[str, torch.dtype] = None
static_attention_dtype: Union[str, torch.dtype] = None
quant_lm_head: bool = False
fp_layers: str = None
ignore_layers: str = None


@dataclass
Expand Down
16 changes: 8 additions & 8 deletions auto_round/compressors/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@ def set_layer_config(
supported_types: tuple,
inner_supported_types: tuple,
quant_block_list=None,
fp_layers: str = "",
ignore_layers: str = "",
quant_lm_head: bool = False,
enable_gguf_official_mixed: bool = True,
is_mllm: bool = False,
Expand Down Expand Up @@ -261,8 +261,8 @@ def normalize_item(item: Union[str, dict, "QuantizationScheme"], layer_name: str
scheme_keys = tuple(f.name for f in fields(QuantizationScheme)) + ("scale_dtype",)
layer_config = copy.deepcopy(layer_config) or {}

# 1. fp_layers -> force 16
for name in get_fp_layer_names(model, fp_layers):
# 1. ignore_layers -> force 16
for name in get_fp_layer_names(model, ignore_layers):
layer_config[name] = {
"bits": 16,
"act_bits": 16,
Expand Down Expand Up @@ -852,7 +852,7 @@ def _set_config(config, target_config):
return layer_config, gguf_format_config


def get_fp_layer_names(model: torch.nn.Module, fp_layers: str):
def get_fp_layer_names(model: torch.nn.Module, ignore_layers: str):
"""Identifies and returns layers in the model to exclude from quantization.

This function processes a comma-separated list of fully precision (FP) layers,
Expand All @@ -861,7 +861,7 @@ def get_fp_layer_names(model: torch.nn.Module, fp_layers: str):

Args:
model (torch.nn.Module): The model whose layers will be inspected.
fp_layers (str): A comma-separated string of layer names to be excluded
ignore_layers (str): A comma-separated string of layer names to be excluded
from quantization. Whitespace is ignored in this string.

Returns:
Expand All @@ -870,16 +870,16 @@ def get_fp_layer_names(model: torch.nn.Module, fp_layers: str):
"""
from auto_round.utils import SUPPORTED_LAYER_TYPES

if not fp_layers:
if not ignore_layers:
return []
fp_layers = fp_layers.replace(" ", "").split(",")
ignore_layers = ignore_layers.replace(" ", "").split(",")
all_layer_names = []
for n, m in model.named_modules():
if type(m) in SUPPORTED_LAYER_TYPES:
all_layer_names.append(n)
not_to_quantized_layers = []

for fp_layer in fp_layers:
for fp_layer in ignore_layers:
if fp_layer == "":
continue
if fp_layer in all_layer_names:
Expand Down
2 changes: 1 addition & 1 deletion auto_round/export/export_to_autoround/export.py
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ def save_quantized_as_autoround(
neq_keys = check_neq_config(cfg, **{k: quantization_config[k] for k in scheme_keys})
if len(neq_keys) > 0:
extra_config[layer_name] = {}
for key in scheme_keys:
for key in neq_keys:
if cfg.get(key) is not None:
extra_config[layer_name][key] = cfg[key]

Expand Down
2 changes: 1 addition & 1 deletion docs/tips_and_tricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ cause the outputs to become excessively large. Even when a fake QDQ model is fin
with the INT4 kernel.

**Suggestion: It is recommended not to quantize layers with large output values**. In AutoRound (>=0.4), you can
use `--fp_layers "xxx,xxx"` to exclude these layers.
use `--ignore_layers "xxx,xxx"` to exclude these layers.

**Reasoning**: While adjusting the quantization configuration (symmetric/asymmetric) or using `clamp_to_range` may
provide some benefit, configuration tuning can be tedious, and `clamp_to_range` is not always effective. Therefore, it
Expand Down
14 changes: 0 additions & 14 deletions test/test_cpu/test_act_quantization.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,15 +173,7 @@ def test_WOQ_config_INT_saving(self, tiny_opt_model_path, dataloader):

# check inblock layer config values
kproj_config = extra_config["model.decoder.layers.1.self_attn.k_proj"]
assert "act_data_type" in kproj_config.keys() and kproj_config["act_data_type"] == "float"
assert "act_bits" in kproj_config.keys() and kproj_config["act_bits"] == 16
assert "act_group_size" in kproj_config.keys() and kproj_config["act_group_size"] == 128
assert "act_sym" in kproj_config.keys() and not kproj_config["act_sym"]
assert "data_type" in kproj_config.keys() and kproj_config["data_type"] == "int"
assert "bits" in kproj_config.keys() and kproj_config["bits"] == 8
assert "group_size" in kproj_config.keys() and kproj_config["group_size"] == 128
assert "sym" in kproj_config.keys() and not kproj_config["sym"]
assert "act_dynamic" in kproj_config.keys() and kproj_config["act_dynamic"]
shutil.rmtree(quantized_model_path, ignore_errors=True)

def test_act_config_FP8_saving(self, tiny_opt_model_path, dataloader):
Expand Down Expand Up @@ -214,12 +206,6 @@ def test_act_config_FP8_saving(self, tiny_opt_model_path, dataloader):

# check inblock layer config values
kproj_config = extra_config["model.decoder.layers.0.self_attn.k_proj"]
assert "act_data_type" in kproj_config.keys() and kproj_config["act_data_type"] == "fp"
assert "act_bits" in kproj_config.keys() and kproj_config["act_bits"] == 16
assert "act_group_size" in kproj_config.keys() and kproj_config["act_group_size"] == 0
assert "act_sym" in kproj_config.keys() and kproj_config["act_sym"]
assert "data_type" in kproj_config.keys() and kproj_config["data_type"] == "fp"
assert "bits" in kproj_config.keys() and kproj_config["bits"] == 8
assert "group_size" in kproj_config.keys() and kproj_config["group_size"] == 0
assert "sym" in kproj_config.keys() and kproj_config["sym"]
shutil.rmtree(quantized_model_path, ignore_errors=True)
6 changes: 3 additions & 3 deletions test/test_cpu/test_gguf_format.py
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ def test_qtype_setting(self):
ar.supported_types,
ar.inner_supported_types,
ar.quant_block_list,
ar.fp_layers,
ar.ignore_layers,
ar.quant_lm_head,
enable_gguf_official_mixed=True,
is_mllm=ar.mllm,
Expand All @@ -249,7 +249,7 @@ def test_qtype_setting(self):
ar.supported_types,
ar.inner_supported_types,
ar.quant_block_list,
ar.fp_layers,
ar.ignore_layers,
ar.quant_lm_head,
enable_gguf_official_mixed=True,
is_mllm=ar.mllm,
Expand All @@ -271,7 +271,7 @@ def test_qtype_setting(self):
ar.supported_types,
ar.inner_supported_types,
ar.quant_block_list,
ar.fp_layers,
ar.ignore_layers,
ar.quant_lm_head,
enable_gguf_official_mixed=True,
is_mllm=ar.mllm,
Expand Down
2 changes: 1 addition & 1 deletion test/test_cpu/test_moe_alignment.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def test_moe_scale_alignment_fp8_static(setup_deepseek_v2_lite):
nsamples=4,
iters=0, # RTN for faster testing
seqlen=32,
fp_layers="self_attn,lm_head",
ignore_layers="self_attn,lm_head",
)
quantized_model, save_folder = autoround.quantize_and_save(format="auto_round", output_dir=output_dir)

Expand Down
4 changes: 2 additions & 2 deletions test/test_cpu/test_moe_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ def quantize_model(model, tokenizer, output_dir, scheme, iters=0):
nsamples=2,
iters=iters,
seqlen=32,
fp_layers="self_attn,router,lm_head,mlp.gate",
ignore_layers="self_attn,router,lm_head,mlp.gate",
)
quantized_model, save_folder = autoround.quantize_and_save(format="auto_round", output_dir=output_dir)
return quantized_model
Expand Down Expand Up @@ -154,7 +154,7 @@ def test_qwen3_vl_moe_mxfp(setup_qwen3_vl_moe):
nsamples=2,
seqlen=32,
iters=1,
fp_layers="self_attn,lm_head,mlp.gate",
ignore_layers="self_attn,lm_head,mlp.gate",
)
quantized_model, _ = autoround.quantize_and_save(format="auto_round", output_dir=output_dir)
assert quantized_model is not None, "Quantized model should not be None."
Expand Down
Loading