use chat_template in speculative training #123

RuBing-Yang · 2025-11-03T11:08:14Z

This pull request removes the custom chat template management system from the data pipeline and simplifies how chat templates are applied during dataset preparation. The changes streamline the code by relying on the tokenizer's built-in chat template handling, and update the data processing logic to compute the loss mask per message, rather than using string pattern matching. Additionally, related command-line arguments and imports for chat template selection are removed.

Data Pipeline Simplification:

Removed the entire chat_templates.py module, including all classes and functions for managing chat templates.
Refactored DatasetBuilder and DatasetManager in online_dataset.py to eliminate dependencies on chat template types and external template management. Now, chat templates are applied directly via the tokenizer, and loss masks are computed per message based on role. [1] [2] [3] [4] [5]

Training Script Adjustments:

Removed the --chat_template_type argument and all related logic from train_eagle3_online.py, as chat template selection is no longer necessary. [1] [2] [3]

Model Code Minor Cleanup:

Removed the unused TransformersKwargs import and relaxed the type of **kwargs in the forward method of modeling_qwen3_kv.py. [1] [2]

Copilot

Pull Request Overview

This PR removes the custom chat template system and simplifies the dataset processing by relying on the tokenizer's built-in apply_chat_template functionality. The change eliminates the need for maintaining custom chat template configurations and regex-based loss mask creation.

Removed the entire chat_templates.py module and its associated CLI argument
Refactored dataset processing to use tokenizer's apply_chat_template on a per-message basis
Simplified loss mask creation by applying masks based on message roles instead of using regex pattern matching on the formatted text
Removed unused TransformersKwargs import

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
tools/train_eagle3_online.py	Removed `--chat_template_type` CLI argument and its usage in dataset creation
angelslim/compressor/speculative/train/data/online_dataset.py	Refactored message processing to use per-message tokenization and role-based loss masking
angelslim/compressor/speculative/train/data/chat_templates.py	Deleted entire custom chat template module
angelslim/compressor/speculative/inference/models/eagle3/target/modeling_qwen3_kv.py	Removed unused `TransformersKwargs` import from type hints

Comments suppressed due to low confidence (1)

angelslim/compressor/speculative/train/data/online_dataset.py:1

Corrected the print message by removing the extra space before 'datasets' and simplifying the message structure.

# Copyright 2025 Tencent Inc. All Rights Reserved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-03T11:13:38Z

angelslim/compressor/speculative/train/data/online_dataset.py

+            for message in messages:
+                message_tokens = self.tokenizer.apply_chat_template(
+                    [message],
+                    tokenize=True,
+                    add_generation_prompt=False,
+                    return_tensors="pt",
+                ).squeeze(0)
+
+                # Determine the loss mask based on the role
+                if message["role"] in ["system", "user"]:
+                    mask = torch.zeros_like(message_tokens)
+                else:  # assistant
+                    mask = torch.ones_like(message_tokens)


The per-message tokenization approach may produce incorrect results. Calling apply_chat_template on individual messages wrapped in a list does not guarantee the same tokenization as applying it to the full conversation at once. Chat templates often have context-dependent formatting (e.g., different handling for first vs. subsequent messages, or special tokens between messages) that will be lost when processing messages independently. This could lead to incorrect token boundaries and potentially malformed chat formatting.

Copilot · 2025-11-03T11:13:38Z

angelslim/compressor/speculative/train/data/online_dataset.py

+                message_tokens = self.tokenizer.apply_chat_template(
+                    [message],
+                    tokenize=True,
+                    add_generation_prompt=False,
+                    return_tensors="pt",
+                ).squeeze(0)
+
+                # Determine the loss mask based on the role
+                if message["role"] in ["system", "user"]:
+                    mask = torch.zeros_like(message_tokens)
+                else:  # assistant
+                    mask = torch.ones_like(message_tokens)


The loss mask now includes special tokens and chat template formatting tokens for assistant messages, which is likely incorrect. The old implementation used offset mapping to identify only the actual response content, excluding the assistant header (e.g., <|im_start|>assistant\n). The new approach will apply loss to all tokens in the assistant message including template-specific tokens, which may degrade training quality by forcing the model to learn to predict formatting tokens rather than just content.

Suggested change

message_tokens = self.tokenizer.apply_chat_template(

[message],

tokenize=True,

add_generation_prompt=False,

return_tensors="pt",

).squeeze(0)

# Determine the loss mask based on the role

if message["role"] in ["system", "user"]:

mask = torch.zeros_like(message_tokens)

else: # assistant

mask = torch.ones_like(message_tokens)

# Get the formatted string for the message

formatted_str = self.tokenizer.apply_chat_template(

[message],

tokenize=False,

add_generation_prompt=False,

)

# Tokenize with offsets

encoding = self.tokenizer(

formatted_str,

return_tensors="pt",

return_offsets_mapping=True,

)

message_tokens = encoding["input_ids"].squeeze(0)

offsets = encoding["offset_mapping"].squeeze(0)

# Determine the loss mask based on the role

if message["role"] in ["system", "user"]:

mask = torch.zeros_like(message_tokens)

else: # assistant

# Find the span of the assistant's response in the formatted string

# Heuristic: find the first occurrence of message["content"] in formatted_str

content = message.get("content", "")

if not content:

mask = torch.zeros_like(message_tokens)

else:

start_idx = formatted_str.find(content)

end_idx = start_idx + len(content)

# Set mask to 1 for tokens whose offsets are within the content span

mask = torch.tensor(

[

1 if (offset[0] >= start_idx and offset[1] <= end_idx and offset[0] < offset[1])

else 0

for offset in offsets.tolist()

],

dtype=torch.long,

)

RuBing-Yang added 2 commits October 29, 2025 20:36

fix TransformersKwargs not found problem

e98b083

use transformers chat_template

ebe51be

RuBing-Yang requested a review from Copilot November 3, 2025 11:11

Copilot AI reviewed Nov 3, 2025

View reviewed changes

RuBing-Yang closed this Nov 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use chat_template in speculative training #123

use chat_template in speculative training #123

Uh oh!

RuBing-Yang commented Nov 3, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 3, 2025

Uh oh!

Copilot AI Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-                message_tokens = self.tokenizer.apply_chat_template(
-                    [message],
-                    tokenize=True,
-                    add_generation_prompt=False,
-                    return_tensors="pt",
-                ).squeeze(0)
-                # Determine the loss mask based on the role
-                if message["role"] in ["system", "user"]:
-                    mask = torch.zeros_like(message_tokens)
-                else:  # assistant
-                    mask = torch.ones_like(message_tokens)
+                # Get the formatted string for the message
+                formatted_str = self.tokenizer.apply_chat_template(
+                    [message],
+                    tokenize=False,
+                    add_generation_prompt=False,
+                )
+                # Tokenize with offsets
+                encoding = self.tokenizer(
+                    formatted_str,
+                    return_tensors="pt",
+                    return_offsets_mapping=True,
+                )
+                message_tokens = encoding["input_ids"].squeeze(0)
+                offsets = encoding["offset_mapping"].squeeze(0)
+                # Determine the loss mask based on the role
+                if message["role"] in ["system", "user"]:
+                    mask = torch.zeros_like(message_tokens)
+                else:  # assistant
+                    # Find the span of the assistant's response in the formatted string
+                    # Heuristic: find the first occurrence of message["content"] in formatted_str
+                    content = message.get("content", "")
+                    if not content:
+                        mask = torch.zeros_like(message_tokens)
+                    else:
+                        start_idx = formatted_str.find(content)
+                        end_idx = start_idx + len(content)
+                        # Set mask to 1 for tokens whose offsets are within the content span
+                        mask = torch.tensor(
+                            [
+if (offset[0] >= start_idx and offset[1] <= end_idx and offset[0] < offset[1])
+                                else 0
+                                for offset in offsets.tolist()
+                            ],
+                            dtype=torch.long,
+                        )

use chat_template in speculative training #123

use chat_template in speculative training #123

Uh oh!

Conversation

RuBing-Yang commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RuBing-Yang commented Nov 3, 2025 •

edited

Loading