Skip to content

Releases: intel/auto-round

v0.4.6

24 Feb 09:23

Choose a tag to compare

Highlights:

1 set torch compile to false by default in #447
2 Fix packing hang and force to fp16 at exporting in #430
3 align auto_quantizer with Transformers 4.49 in #437

What's Changed

Full Changelog: v0.4.5...v0.4.6

v0.4.5

27 Jan 12:12

Choose a tag to compare

Highlights:
We have enhanced support for extremely large models with the following updates:

Multi-Card Tuning Support: Added basic support for multi-GPU tuning. #415 support naive multi-card tuning

Accelerated Packing Stage: Improved the packing speed (2X-4X)for AutoGPTQ and AutoAWQ formats by leveraging cuda. #407 speedup packing stage for autogptq and autoawq forma

Deepseek V3 GGUF Export: Introduced support for exporting models to the Deepseek V3 GGUF format. #416 support to export deepseek v3 gguf format

What's Changed

Full Changelog: v0.4.4...v0.4.5

v0.4.4 release

10 Jan 01:47
86767b0

Choose a tag to compare

Highlights:
1 Fix install issue in #387
2 support to export gguf q4_0 and q4_1 format in #393
3 fix llm cmd line seqlen issue in #399

What's Changed

Full Changelog: v0.4.3...v0.4.4

v0.4.3: bug fix release

16 Dec 03:24
3323371

Choose a tag to compare

Highlights:
fix incorrect device setting in autoround format inference by @WeiweiZhang1 in #383
remove the dependency on AutoGPTQ by @XuehaoSun in #380

What's Changed

Full Changelog: v0.4.2...v0.4.3

v0.4.2: bug fix release

09 Dec 09:44

Choose a tag to compare

Highlights

1 Fix autoawq exporting issue
2 remove bias exporting if possible in autogptq format

What's Changed

Full Changelog: v0.4.1...v0.4.2

v0.4.1: bug fix release

27 Nov 09:53

Choose a tag to compare

Highlights:

  • Fixed vllm calibration infinite loop issue
  • Corrected the default value for the sym argument in the API configuration.

What's Changed

Full Changelog: v0.4...v0.4.1

v0.4

22 Nov 13:32

Choose a tag to compare

Highlights

[Experimental Feature] We provide API support for VLM models
[Kernel] We add ipex support for intel cpu
[Bug fix] We fix tuning bug for glm4 model
[Enhancement] better align gradient_accumulate_steps behavior for varied length input

What's Changed

Full Changelog: v0.3.1...v0.4

Intel® auto-round v0.3.1 Release

21 Oct 04:12

Choose a tag to compare

Release Highlights:

New Features:

Full-Range Symmetric Quantization: We’ve introduced full-range symmetric quantization, which often matches or even exceeds the performance of asymmetric quantization, especially at lower bit widths, such as 2.

Command-Line Support: You can now quantize models using the command auto-round --model xxx --format xxx

Default Exporting Format Change: The default format has been updated to auto_round instead of auto_gptq.

Muiti-thread packing: up to 2X speed up on packing phase

Bug Fixes:

Resolved Missing Cached Position Embeddings: Fixed an issue with missing cached position embeddings in Transformer version 4.45.2.

Mutable Default Values Issue: Addressed problems related to mutable default values.

3 bit packing bug for AutoGPTQ format

What's Changed

New Contributors

Full Changelog: v0.3...v0.3.1

Intel® auto-round v0.3 Release

14 Aug 11:33

Choose a tag to compare

  • Highlights:

    • Broader Device Support:
      • Expanded support for CPU, HPU, and CUDA inference in the AutoRound format, resolving the 2-bit accuracy issue.
    • New Recipes and Model Releases:
    • Experimental Features:
      • Introduced several experimental features, including activation quantization and mx_fp, with promising outcomes with AutoRound.
    • Multimodal Model Support:
      • Extended capabilities for tuning and inference across several multimodal models.

    Lowlights:

    • Implemented support for low_cpu_mem_usage, auto_awq format, calibration dataset concatenation, and calibration datasets with chat templates.

Intel® auto-round v0.2 Release

30 May 02:13
aafb82e

Choose a tag to compare

Overview

We supported the Intel XPU format and implemented lm-head quantization and inference, reducing the model size from 5.4GB to 4.7GB for LLAMA3 at W4G128. Additionally, we supported both local and mixed online datasets for calibration. By optimizing memory usage and tuning costs, the calibration process now takes approximately 20 minutes for 7B models and 2.5 hours for 70B models with 512 samples by setting disable_low_gpu_mem_usage.

Others:

More accuracy data as presented in [paper](https://arxiv.org/pdf/2309.05516) and [low_bit_open_llm_leaderboard](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard)

More technical details as presented in [paper](https://arxiv.org/pdf/2309.05516)

Known issues:

Large discrepancy between gptq model and qdq model for asymmetric quantization in some scenarios. We are working on it.