Releases · intel/auto-round

24 Feb 09:23

v0.4.6

1320752

v0.4.6

Highlights:

1 set torch compile to false by default in #447
2 Fix packing hang and force to fp16 at exporting in #430
3 align auto_quantizer with Transformers 4.49 in #437

What's Changed

Fix packing hang, torch compile and force to fp16 at exporting by @wenhuach21 in #430
fix nblocks issues by @wenhuach21 in #432
rm gc collect in packing by @wenhuach21 in #438
align auto_quantizer with main branch in Transformers by @WeiweiZhang1 in #437
[HPU]Fix compile bug when quant layer by @yiliu30 in #441
remove tricky setting in mxfp4 by @wenhuach21 in #445
fix bug of evaluate user model by @n1ck-guo in #444
Refine funcs by @WeiweiZhang1 in #446
set torch compile to false by default by @WeiweiZhang1 in #447

Full Changelog: v0.4.5...v0.4.6

Contributors

yiliu30, wenhuach21, and 2 other contributors

Assets 2

27 Jan 12:12

wenhuach21

v0.4.5

e38a306

v0.4.5

Highlights:
We have enhanced support for extremely large models with the following updates:

Multi-Card Tuning Support: Added basic support for multi-GPU tuning. #415 support naive multi-card tuning

Accelerated Packing Stage: Improved the packing speed (2X-4X)for AutoGPTQ and AutoAWQ formats by leveraging cuda. #407 speedup packing stage for autogptq and autoawq forma

Deepseek V3 GGUF Export: Introduced support for exporting models to the Deepseek V3 GGUF format. #416 support to export deepseek v3 gguf format

What's Changed

update format readme by @wenhuach21 in #411
fix log bug and device "auto" bug by @n1ck-guo in #409
speedup packing stage for autogptq and autoawq format by @wenhuach21 in #407
support naive multi-card tuning by @wenhuach21 in #415
support bf16 inference for autoround format by @wenhuach21 in #420
enable backup pile dataset loading by @WeiweiZhang1 in #417
fix evaluation device bug, relate to issue 413 by @n1ck-guo in #419
support to export deepseek v3 gguf format by @n1ck-guo in #416
fix cuda UT torch_dtype by @WeiweiZhang1 in #423
fix eval trust_remote_code by @n1ck-guo in #424

Full Changelog: v0.4.4...v0.4.5

Contributors

wenhuach21, WeiweiZhang1, and n1ck-guo

Assets 2

10 Jan 01:47

wenhuach21

v0.4.4

86767b0

v0.4.4 release

Highlights:
1 Fix install issue in #387
2 support to export gguf q4_0 and q4_1 format in #393
3 fix llm cmd line seqlen issue in #399

What's Changed

fix a critic bug of static activation quantization by @wenhuach21 in #392
vlm 70B+ in single card by @n1ck-guo in #395
enhance calibration dataset and add awq pre quantization warning by @wenhuach21 in #396
support awq format for vlms by @WeiweiZhang1 in #398
[critic bug]fix llm example seqlen issue by @WeiweiZhang1 in #399
fix device auto issue by @wenhuach21 in #400
Fix auto-round install & bump into 0.4.4 by @XuehaoSun in #387
fix dtype converting issue by @wenhuach21 in #403
support for deepseek vl2 by @n1ck-guo in #401
llm_layer_config_bugfix by @WeiweiZhang1 in #406
support awq with qbits, only support sym by @wenhuach21 in #402
support to export gguf q4_0 and q4_1 format by @n1ck-guo in #393

Full Changelog: v0.4.3...v0.4.4

Contributors

wenhuach21, XuehaoSun, and 2 other contributors

Assets 2

16 Dec 03:24

wenhuach21

v0.4.3

3323371

v0.4.3: bug fix release

Highlights:
fix incorrect device setting in autoround format inference by @WeiweiZhang1 in #383
remove the dependency on AutoGPTQ by @XuehaoSun in #380

What's Changed

support_llava_hf_vlm_example by @WeiweiZhang1 in #381
fix block_name_to_quantize by @WeiweiZhang1 in #382
fix incorrect device setting in autoround format inference by @WeiweiZhang1 in #383
refine homepage, update model links by @WeiweiZhang1 in #385
update eval basic usage by @n1ck-guo in #384
refine error msg and dump more log in the tuning by @wenhuach21 in #386
remove the dependency on AutoGPTQ for CPU and bump to V0.4.3 by @XuehaoSun in #380

Full Changelog: v0.4.2...v0.4.3

Contributors

wenhuach21, XuehaoSun, and 2 other contributors

Assets 2

09 Dec 09:44

wenhuach21

v0.4.2

9249b14

v0.4.2: bug fix release

Highlights

1 Fix autoawq exporting issue
2 remove bias exporting if possible in autogptq format

What's Changed

bump version into v0.4.1 by @XuehaoSun in #350
Update docker user and remove baseline UT by @XuehaoSun in #347
delete llm example and refine readme by @wenhuach21 in #354
Simulated W4Afp8 Quantization by @wenhuach21 in #331
add QWQ-32B, VLM, Qwen2.5, Llama3.1 int4 models by @wenhuach21 in #356
fix awq exporting by @wenhuach21 in #358
Tensor reshape bugfix by @WeiweiZhang1 in #364
fix awq backend and fp_layers issue by @wenhuach21 in #363
fix awq exporting bugs by @wenhuach21 in #365
fix bug of only_text_test check due to inference issue on cpu by @n1ck-guo in #362
add gpu test by @wenhuach21 in #367
using multicard when device set to "auto" by @n1ck-guo in #368
quant_block_names enhancement by @WeiweiZhang1 in #369
[HPU] Add lazy mode back by @yiliu30 in #371
remove bias exporting if possible in autogptq format by @wenhuach21 in #375
save processor automatically by @n1ck-guo in #372
Add gpu ut by @wenhuach21 in #370
fix gpu ut by @n1ck-guo in #376
fix typos by @wenhuach21 in #377

Full Changelog: v0.4.1...v0.4.2

Contributors

yiliu30, wenhuach21, and 3 other contributors

Assets 2

27 Nov 09:53

wenhuach21

v0.4.1

d562895

v0.4.1: bug fix release

Highlights:

Fixed vllm calibration infinite loop issue
Corrected the default value for the sym argument in the API configuration.

What's Changed

fix typo by @wenhuach21 in #342
vllm/llama-vision llava calibration infinite loop fix by @WeiweiZhang1 in #343
[HPU]Enhance numba check by @yiliu30 in #345
[VLM]fix bs and grad reset by @n1ck-guo in #344
[HPU]Enhance installation check by @yiliu30 in #346
[Critical Bug]API use sym as default by @wenhuach21 in #349
triton backend requires< 3.0 by @wenhuach21 in #348

Full Changelog: v0.4...v0.4.1

Contributors

yiliu30, wenhuach21, and 2 other contributors

Assets 2

22 Nov 13:32

wenhuach21

v0.4

f7913f9

v0.4

Highlights

[Experimental Feature] We provide API support for VLM models
[Kernel] We add ipex support for intel cpu
[Bug fix] We fix tuning bug for glm4 model
[Enhancement] better align gradient_accumulate_steps behavior for varied length input

What's Changed

refine AuoRound format and support marlin repacking by @wenhuach21 in #280
update readme for v0.3.1 release by @wenhuach21 in #283
update readme for cpu inference by @wenhuach21 in #284
avoid deterministic algorithm warning in inference by @wenhuach21 in #285
fix mx_fp issues by @wenhuach21 in #286
update torch ao integration information by @wenhuach21 in #287
Refine code by @wenhuach21 in #291
Add ipex support for intel cpu by @wenhuach21 in #292
fix ipex tqdm mismatch issue by @wenhuach21 in #293
fix bug of backend by @wenhuach21 in #294
[Experimental Feature]support for common hf multimodel by @n1ck-guo in #276
use torch.compile by default for PyTorch versions 2.6 and above by @wenhuach21 in #295
refine forward hook by @WeiweiZhang1 in #290
eval for MLLMs by @n1ck-guo in #296
mllm eval bug fix by @n1ck-guo in #297
Port Numba-based packing from INC by @yiliu30 in #301
refine model config file for mixed precision quantization by @wenhuach21 in #300
fix glm4-9b batch dim issue by @wenhuach21 in #304
better align gradient_accumulate_steps for varied length input by @wenhuach21 in #309
Enable torch.compile on HPU by @yiliu30 in #307
Update autogptq exporting by @wenhuach21 in #310
fix typo by @wenhuach21 in #311
qwen2 vision quantization bugfix by @WeiweiZhang1 in #313
multiple gpu evaluation/calibration refine by @wenhuach21 in #312
HPU only release binary by @yiliu30 in #302
patch 1 for mllm by @n1ck-guo in #298
add torch compile arg by @wenhuach21 in #314
fix merge error by @n1ck-guo in #316
Update the check for HPU by @yiliu30 in #318
fix eval device issue by @wenhuach21 in #319
fix multiple device bug by @wenhuach21 in #321
add warning for no gptq exllamav2 kernel by @wenhuach21 in #324
add pile calib, rename quant_block_list to to_quant_block_names by @WeiweiZhang1 in #322
fix autogptq version error by @wenhuach21 in #325
new mllm eval by @n1ck-guo in #317
Add cpu only version by @XuehaoSun in #315
set default mllm dataset by @n1ck-guo in #327
fix fp_layers issue and force to FP16 on cuda for autoround format inference by @wenhuach21 in #326
fix the bug of test model support for test-only by @n1ck-guo in #328
Increase unit test timeout to 120 minutes by @XuehaoSun in #330
fix mllm dataset config bug and add gptq cuda backend by @wenhuach21 in #329
add tips and tricks for llm&mllm quantization by @wenhuach21 in #333
fix eval_bs in fake format and reset auto-gptq exporting max_shard_size by @wenhuach21 in #332
fix model_dtype issue and reformat mllm code by @wenhuach21 in #335
Exclude markdown files from unit test pipelines by @XuehaoSun in #337
refine mllm docs by @WeiweiZhang1 in #336
cogvlm doc by @n1ck-guo in #339
add qwen2.5 recipe and refine readme by @WeiweiZhang1 in #338
add cogvlm recipe and refine readme by @WeiweiZhang1 in #340
refine mllm API and add help info by @n1ck-guo in #334

Full Changelog: v0.3.1...v0.4

Contributors

yiliu30, wenhuach21, and 3 other contributors

Assets 2

21 Oct 04:12

wenhuach21

v0.3.1

0c4319c

Intel® auto-round v0.3.1 Release

Release Highlights:

New Features:

Full-Range Symmetric Quantization: We’ve introduced full-range symmetric quantization, which often matches or even exceeds the performance of asymmetric quantization, especially at lower bit widths, such as 2.

Command-Line Support: You can now quantize models using the command auto-round --model xxx --format xxx

Default Exporting Format Change: The default format has been updated to auto_round instead of auto_gptq.

Muiti-thread packing: up to 2X speed up on packing phase

Bug Fixes:

Resolved Missing Cached Position Embeddings: Fixed an issue with missing cached position embeddings in Transformer version 4.45.2.

Mutable Default Values Issue: Addressed problems related to mutable default values.

3 bit packing bug for AutoGPTQ format

What's Changed

Add setseed in autoround by @WeiweiZhang1 in #201
support autoawq format by @yintong-lu in #115
Remove UT coverage check by @XuehaoSun in #202
set autoround format as default to unify CPU/HPU/CUDA by @wenhuach21 in #205
add local file of pile-10k by @WeiweiZhang1 in #198
modify setup.py by @n1ck-guo in #206
limit the scale minimum value not to 0 by @WeiweiZhang1 in #211
fix example dataset regression by @WeiweiZhang1 in #212
remove local pile file by @WeiweiZhang1 in #213
update xpu format exporting by @WeiweiZhang1 in #214
fix a bug in autoround format inference by @wenhuach21 in #215
avoid underflow and overflow for exllamav2 by @wenhuach21 in #218
add qwen int4 model, refine example by @WeiweiZhang1 in #217
[Experimental Feature]fast tuning norm/bias at 2 bits by @wenhuach21 in #208
update readme by @wenhuach21 in #220
refine eval_042 to enable parallelize evaluation by @WeiweiZhang1 in #221
Enable phi3v tuning by @WeiweiZhang1 in #197
Bump setuptools from 69.5.1 to 70.0.0 in /examples/multimodal-modeling/Phi-3-vision by @dependabot in #223
refine example by @WeiweiZhang1 in #224
change the scale thresh generally by @WeiweiZhang1 in #229
add quantized models by 3rd party by @WeiweiZhang1 in #230
add meta3.1-70B-instruct model, refine docs by @WeiweiZhang1 in #231
fix model link by @WeiweiZhang1 in #232
refine docs, add accuracy data, add receip and eval scripts by @WeiweiZhang1 in #226
add brief formats introduction by @wenhuach21 in #236
update readme and add itrex in the requirements.txt by @wenhuach21 in #238
add tritonv2, improve packing and pbar by @wenhuach21 in #239
refine the code and the speedup is notable by @wenhuach21 in #240
move some settings from example to main by @wenhuach21 in #241
add runable script for autoround by @n1ck-guo in #225
update readme by @n1ck-guo in #242
Add MANIFEST.in file to include requirements.txt by @XuehaoSun in #243
fix example bug by @n1ck-guo in #245
enable llava int4 inference with autoround format by @WeiweiZhang1 in #237
remove autoawq requirement at packing stage by @n1ck-guo in #249
remove unused log by @n1ck-guo in #252
support INC API by @WeiweiZhang1 in #255
avoid potential bug for auto-gptq 0.8 by @wenhuach21 in #250
fix example by @n1ck-guo in #256
fix preci by @n1ck-guo in #258
enable_qwen2-vl_quantization by @WeiweiZhang1 in #248
update eval and fix example by @n1ck-guo in #260
refine autoawq exporting code by @wenhuach21 in #261
better support quant_lm_head for larger models by @wenhuach21 in #263
Fix 3bit packing for auto-gptq format by @wenhuach21 in #264
Add a warning for improper export formats. by @wenhuach21 in #265
Update readme for VLM support and integration by @wenhuach21 in #266
remove g_idx in gptq format by @wenhuach21 in #267
keep the dtype after qdq by @wenhuach21 in #268
enable llama3.2-vision model quantization by @WeiweiZhang1 in #269
fix mutable default value by @wenhuach21 in #272
change to even rounding for mantissa of mx_fp by @wenhuach21 in #277
adamround bugfix, refine import by @WeiweiZhang1 in #275
[Important Change]set full range sym as the default by @wenhuach21 in #278
refine eval by @wenhuach21 in #282
qwen2_bugfix, add adamround vision UT by @WeiweiZhang1 in #281

New Contributors

@dependabot made their first contribution in #223

Full Changelog: v0.3...v0.3.1

Contributors

dependabot, wenhuach21, and 4 other contributors

Assets 2

14 Aug 11:33

wenhuach21

v0.3

4ac1104

Intel® auto-round v0.3 Release

Highlights:
- Broader Device Support:
  - Expanded support for CPU, HPU, and CUDA inference in the AutoRound format, resolving the 2-bit accuracy issue.
- New Recipes and Model Releases:
  - Published numerous recipes on the Low Bit Open LLM Leaderboard, showcasing impressive results on LLaMa 3.1 and other leading models.
- Experimental Features:
  - Introduced several experimental features, including activation quantization and mx_fp, with promising outcomes with AutoRound.
- Multimodal Model Support:
  - Extended capabilities for tuning and inference across several multimodal models.
Lowlights:
- Implemented support for low_cpu_mem_usage, auto_awq format, calibration dataset concatenation, and calibration datasets with chat templates.

Assets 2

30 May 02:13

wenhuach21

v0.2

aafb82e

Intel® auto-round v0.2 Release

Overview

We supported the Intel XPU format and implemented lm-head quantization and inference, reducing the model size from 5.4GB to 4.7GB for LLAMA3 at W4G128. Additionally, we supported both local and mixed online datasets for calibration. By optimizing memory usage and tuning costs, the calibration process now takes approximately 20 minutes for 7B models and 2.5 hours for 70B models with 512 samples by setting disable_low_gpu_mem_usage.

Others:

More accuracy data as presented in [paper](https://arxiv.org/pdf/2309.05516) and [low_bit_open_llm_leaderboard](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard)

More technical details as presented in [paper](https://arxiv.org/pdf/2309.05516)

Known issues:

Large discrepancy between gptq model and qdq model for asymmetric quantization in some scenarios. We are working on it.

Assets 2

Releases: intel/auto-round

v0.4.6

Highlights:

What's Changed

Contributors

Uh oh!

v0.4.5

What's Changed

Contributors

Uh oh!

v0.4.4 release

What's Changed

Contributors

Uh oh!

v0.4.3: bug fix release

What's Changed

Contributors

Uh oh!

v0.4.2: bug fix release

Highlights

What's Changed

Contributors

Uh oh!

v0.4.1: bug fix release

What's Changed

Contributors

Uh oh!

v0.4

Highlights

What's Changed

Contributors

Uh oh!

Intel® auto-round v0.3.1 Release

What's Changed

New Contributors

Contributors

Uh oh!

Intel® auto-round v0.3 Release

Uh oh!

Intel® auto-round v0.2 Release

Uh oh!