Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 36 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ A more accessible, comprehensive, and efficient toolkit for large model compress
</p>

## 📣Latest News
- [25/11/03] We have released v0.2. Quantization support for new models, such as `GLM-4.6` and `Qwen3-VL`, open-sources the Eagle3 speculative decoding training framework, and updates the Diffusion model quantization tools.
- [25/11/05] We have released v0.2. Quantization support for new models, such as `GLM-4.6`, `Qwen3-VL` and `Qwen3-Omni`, open-sources the Eagle3 speculative decoding training framework, and updates the Diffusion model quantization tools.
- [25/09/30] We have released **SpecExit**, the reasoning early-exit algorithm: [[Paper]](http://arxiv.org/abs/2509.24248) | [[Docs]](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/spec_exit.html) | [[vLLM Code]](https://github.com/vllm-project/vllm/pull/27192)🔥🔥🔥
- [25/09/26] We have released **TEQUILA**, the ternary quantization algorithm [[Paper]](https://arxiv.org/abs/2509.23809) | [[Code]](https://github.com/Tencent/AngelSlim/tree/tequila/TernaryQuant)🔥🔥🔥
- [25/09/24] We now support the PTQ quantification of NVFP4 for the Qwen3 series models. We also opensource [Qwen3-32B-NVFP4](https://huggingface.co/AngelSlim/Qwen3-32B_nvfp4) and [Qwen3-235B-A22B-NVFP4](https://huggingface.co/AngelSlim/Qwen3-235B-A22B_nvfp4) weights.
Expand Down Expand Up @@ -171,7 +171,7 @@ A more accessible, comprehensive, and efficient toolkit for large model compress
</td>
<td>
<ul style="padding-left: 0; list-style-position: inside;">
<li>Under Development</li>
<li><a href="https://github.com/Tencent/AngelSlim/blob/main/docs/source/models/qwen3_omni/qwen3_omni_quant.md">FP8-Static/Dynamic</a></li>
</ul>
</td>
<td>
Expand Down Expand Up @@ -510,7 +510,40 @@ Benchmark results for Qwen2.5VL series models with `BF16`、`FP8-Static`、`FP8-

</details>

#### 1.5 Other Models
#### 1.5 Qwen-Omni Series Models

**Qwen3-Omni Text to Text Benchmark**

Benchmark results for Qwen3-Omni series models in BF16, FP8-Static, and FP8-Dynamic on aime25, gpqa_diamond, and mmlu_redux are as follows:

<table>
<thead>
<tr><th>Model</th><th>Quantization</th><th>aime25</th><th>gpqa_diamond</th><th>mmlu_redux</th></tr>
</thead>
<tbody>
<tr><td rowspan="3">Qwen3-Omni-30B-A3B-Instruct</td><td>BF16</td><td>73.32</td><td>56.77</td><td>88.09</td></tr>
<tr><td>FP8-Static</td><td>71.33</td><td>56.57</td><td>87.91</td></tr>
<tr><td>FP8-Dynamic</td><td>73.33</td><td>55.15</td><td>88.07</td></tr>
</tbody>
</table>

<details>
<summary>Note</summary>

> - The above evaluation results were obtained by deploying with the vLLM framework and averaging over 5 runs (vLLM only supports the thinker component).
> - The hyperparameters used during evaluation are as follows:
> ```json
>{
> "top_p": 0.95,
> "temperature": 0.6,
> "do_sample": true,
> "max-model-len 65536": 65536
>}
>```

</details>

#### 1.6 Other Models

Other models such as GLM-4.6, Qwen2.5, and Seed-OSS have been evaluated on benchmarks like `CEVAL`, `MMLU`, and `GSM8K` using quantization strategies including `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ`.

Expand Down
39 changes: 36 additions & 3 deletions README_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
</p>

## 📣最新进展
- [25/11/03] 我们发布V0.2版本,支持了包括GLM-4.6/Qwen3-VL等更多模型的量化,开源投机采样Eagle3训练框架,更新Diffusion模型量化工具。
- [25/11/05] 我们发布V0.2版本,支持了包括GLM-4.6/Qwen3-VL/Qwen3-Omni等更多模型的量化,开源投机采样Eagle3训练框架,更新Diffusion模型量化工具。
- [25/09/30] 我们开源了思考早退新算法 **SpecExit** [[论文]](http://arxiv.org/abs/2509.24248) | [[文档]](https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/spec_exit.html) | [[vLLM代码]](https://github.com/vllm-project/vllm/pull/27192)🔥🔥🔥
- [25/09/30] 我们发布了三值量化新算法 **Tequila** [[论文]](https://arxiv.org/abs/2509.23809) | [[代码]](https://github.com/Tencent/AngelSlim/tree/tequila/TernaryQuant)。🔥🔥🔥
- [25/09/24] 我们支持了Qwen3系列模型的NVFP4的PTQ量化,我们还开源了[Qwen3-32B-NVFP4](https://huggingface.co/AngelSlim/Qwen3-32B_nvfp4)、[Qwen3-235B-A22B-NVFP4](https://huggingface.co/AngelSlim/Qwen3-235B-A22B_nvfp4)权重。
Expand Down Expand Up @@ -172,7 +172,7 @@
</td>
<td>
<ul style="padding-left: 0; list-style-position: inside;">
<li>建设中</li>
<li><a href="https://github.com/Tencent/AngelSlim/blob/main/docs/source/models/qwen3_omni/qwen3_omni_quant.md">FP8-Static/Dynamic</a></li>
</ul>
</td>
<td>
Expand Down Expand Up @@ -517,7 +517,40 @@ Qwen2.5VL系列模型的`BF16`、`FP8-Static`、`FP8-Dynamic`、`INT4-GPTQ`、`I

</details>

#### 1.5 其他模型
#### 1.5 Qwen-Omni 系列模型

**Qwen3-Omni Text to Text Benchmark**

Qwen3-Omni系列模型的`BF16`、`FP8-Static`、`FP8-Dynamic`在`aime25`、`gpqa_diamond`、`mmlu_redux`上的评测结果如下:

<table>
<thead>
<tr><th>Model</th><th>Quantization</th><th>aime25</th><th>gpqa_diamond</th><th>mmlu_redux</th></tr>
</thead>
<tbody>
<tr><td rowspan="3">Qwen3-Omni-30B-A3B-Instruct</td><td>BF16</td><td>73.32</td><td>56.77</td><td>88.09</td></tr>
<tr><td>FP8-Static</td><td>71.33</td><td>56.57</td><td>87.91</td></tr>
<tr><td>FP8-Dynamic</td><td>73.33</td><td>55.15</td><td>88.07</td></tr>
</tbody>
</table>

<details>
<summary>备注</summary>

> - 以上评测结果使用vllm框架部署测试5次求平均(vllm只支持thinker部分)
> - 评测时使用的超参如下:
> ```json
>{
> "top_p": 0.95,
> "temperature": 0.6,
> "do_sample": true,
> "max-model-len 65536": 65536
>}
>```

</details>

#### 1.6 其他模型

其他模型比如GLM、Qwen2.5、Seed-OSS等模型利用`FP8-Static`、`FP8-Dynamic`、`INT4-GPTQ`、`INT4-AWQ`量化等策略在`CEVAL`、`MMLU`、`GSM8K`上进行了评测。

Expand Down
17 changes: 14 additions & 3 deletions angelslim/compressor/quant/ptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

import json
import os
import warnings

import torch
from safetensors.torch import load_file
Expand Down Expand Up @@ -193,9 +194,19 @@ def _convert(self):
)
is not None
):
self.quant_model.act_scales_dict[name] = self.ptq_hook.observer_dict[
sub_layer
].act_observer.scales()
try:
self.quant_model.act_scales_dict[name] = (
self.ptq_hook.observer_dict[sub_layer].act_observer.scales()
)
except ValueError:
self.quant_model.act_scales_dict[name] = torch.tensor(
1.0, device=torch.cuda.current_device()
)
warnings.warn(
f"Not calibrated for {name}. Using default act scale 1.0.",
RuntimeWarning,
stacklevel=2,
)
if (
getattr( # noqa: B009
self.ptq_hook.observer_dict[sub_layer], "kv_cache_observer"
Expand Down
1 change: 1 addition & 0 deletions angelslim/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,6 @@

from .dataloader import DataLoaderFactory # noqa: F401
from .multimodal_dataset import MultiModalDataset # noqa: F401
from .omni_dataset import OmniDataset # noqa: F401
from .text2image_dataset import Text2ImageDataset # noqa: F401
from .text_dataset import TextDataset # noqa: F401
12 changes: 12 additions & 0 deletions angelslim/data/dataloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@

from .base_dataset import BaseDataset
from .multimodal_dataset import MultiModalDataset
from .omni_dataset import OmniDataset
from .text2image_dataset import Text2ImageDataset
from .text_dataset import TextDataset

Expand All @@ -39,6 +40,7 @@ def create_data_loader(
data_type: str = "auto",
num_workers: int = 0,
inference_settings: Dict = None,
use_audio_in_video: bool = False,
model_name: str = None,
) -> DataLoader:
"""
Expand Down Expand Up @@ -98,6 +100,16 @@ def create_data_loader(
num_samples=num_samples,
inference_settings=inference_settings,
)
elif data_type == "OmniDataset":
dataset = OmniDataset(
processor=processor,
device=device,
max_length=max_length,
num_samples=num_samples,
data_source=data_source,
is_hf_dataset=not os.path.isfile(data_source),
use_audio_in_video=use_audio_in_video,
)
else:
raise ValueError(f"Unsupported data type: {data_type}")

Expand Down
2 changes: 1 addition & 1 deletion angelslim/data/multimodal_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@
import os
from typing import Dict, List, Union

import qwen_vl_utils
from datasets import load_dataset
from PIL import Image
from tqdm import tqdm
from transformers import ProcessorMixin

from ..utils.lazy_imports import qwen_vl_utils
from .base_dataset import BaseDataset


Expand Down
127 changes: 127 additions & 0 deletions angelslim/data/omni_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Copyright 2025 Tencent Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import json
import os
from pathlib import Path
from typing import Dict, List, Union

from transformers import ProcessorMixin

from ..utils.lazy_imports import qwen_omni_utils
from .base_dataset import BaseDataset


class OmniDataset(BaseDataset):
"""Dataset for multimodal (text + image) data"""

def __init__(
self,
processor: ProcessorMixin,
device: str = "cpu",
max_length: int = 4096,
num_samples: int = -1,
data_source: Union[str, Dict] = None,
is_hf_dataset: bool = False,
use_audio_in_video: bool = False,
):
super().__init__(processor, device, max_length)
self.is_hf_dataset = is_hf_dataset
self.use_audio_in_video = use_audio_in_video

self._load_file_based_dataset(data_source, num_samples)

def _load_file_based_dataset(self, data_path: str, num_samples: int):
"""Load dataset from local file system"""
path_obj = Path(data_path)
data_dir = path_obj.parent

line_count = 0
with open(data_path, "r") as f:
for line in f:
if num_samples > 0 and line_count >= num_samples:
break
data = json.loads(line.strip())
video_path = None
audio_path = None
image_path = None

if "video_path" in data:
video_path = os.path.normpath(
os.path.join(data_dir, data["video_path"])
)
if "audio_path" in data:
audio_path = os.path.normpath(
os.path.join(data_dir, data["audio_path"])
)
if "image_path" in data:
image_path = os.path.normpath(
os.path.join(data_dir, data["image_path"])
)

ms = data.get("messages")

conversation = []
for m in ms:
if m["role"] == "system":
conversation.append(
{
"role": "system",
"content": [{"type": "text", "text": m["content"]}],
}
)
elif m["role"] == "user":
content = []
text_content = m["content"]
text_content = (
text_content.replace("<video>", "")
.replace("<audio>", "")
.replace("<image>", "")
)
content.append({"type": "text", "text": text_content})
if video_path:
content.append({"type": "video", "video": video_path})
if audio_path:
content.append({"type": "audio", "audio": audio_path})
if image_path:
content.append({"type": "image", "image": image_path})
conversation.append(
{
"role": "user",
"content": content,
}
)
self._process_and_append(conversation)
line_count += 1

def _process_and_append(self, messages: List[Dict]):
"""Process messages and append to dataset"""
text = self.processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
audios, images, videos = qwen_omni_utils.process_mm_info(
messages, use_audio_in_video=self.use_audio_in_video
)

# Process inputs
inputs = self.processor(
text=text,
images=images,
audios=audios,
videos=videos,
padding=True,
return_tensors="pt",
use_audio_in_video=self.use_audio_in_video,
)
self.data.append(inputs)
17 changes: 15 additions & 2 deletions angelslim/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ def prepare_model(
cache_dir=None,
deploy_backend="vllm",
using_multi_nodes=False,
use_audio_in_video=False,
) -> Any:
"""Load pretrained model and tokenizer
Args:
Expand Down Expand Up @@ -116,6 +117,16 @@ def prepare_model(
using_multi_nodes=using_multi_nodes,
)
self.model_path = model_path
elif self.series in ["Omni"]:
if not model:
self.slim_model.from_pretrained(
model_path,
torch_dtype=torch_dtype,
device_map=device_map,
trust_remote_code=trust_remote_code,
use_audio_in_video=use_audio_in_video,
)
self.model_path = model_path
else:
raise ValueError(f"Unsupported series: {self.series}")

Expand All @@ -131,6 +142,7 @@ def prepare_data(
num_samples=128,
shuffle=True,
inference_settings=None,
use_audio_in_video=False,
model_name=None,
) -> Optional[Any]:
"""Prepare compression dataset"""
Expand All @@ -145,7 +157,7 @@ def prepare_data(
data_type=data_type,
processor=(
self.slim_model.processor
if self.series == "VLM"
if self.series == "VLM" or self.series == "Omni"
else self.slim_model.tokenizer
),
device=self.slim_model.model.device,
Expand All @@ -155,6 +167,7 @@ def prepare_data(
num_samples=num_samples,
data_source=data_path,
inference_settings=inference_settings,
use_audio_in_video=use_audio_in_video,
model_name=model_name,
)
self.max_seq_length = max_length
Expand Down Expand Up @@ -187,7 +200,7 @@ def prepare_compressor(
f"Compression method '{method_name}' not registered. "
f"Available methods: {CompressorFactory.get_available_compressor()}"
)
if self.series in ["LLM", "VLM"]:
if self.series in ["LLM", "VLM", "Omni"]:
global_config.update(self.model_path, self.max_seq_length)

if default_method:
Expand Down
1 change: 1 addition & 0 deletions angelslim/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,5 @@
from .diffusion import * # noqa: F401 F403
from .llm import * # noqa: F401 F403
from .model_factory import SlimModelFactory # noqa: F401
from .omni import * # noqa: F401 F403
from .vlm import * # noqa: F401 F403
Loading