diff --git a/examples/training/data_augment/README.md b/examples/training/data_augment/README.md new file mode 100644 index 0000000000..19480b7315 --- /dev/null +++ b/examples/training/data_augment/README.md @@ -0,0 +1,158 @@ +# Training Math Agent with Data-Augment Strategies + +This example demonstrates how to use **AgentScope-Tuner** to enhance a math problem-solving agent. We will focus on leveraging **Data-Centric** features, such as the `difficulty_based` task selector, to improve data utility and training efficiency. + +## Task Setting + +We use the foundational [math-agent example](../react_agent/main.py) as our baseline to demonstrate the data enhancement capabilities. Notably, these data-centric techniques are generic and customizable, making them adaptable to other agent workflows. + +### Agent Goal and Type +The agent's objective is to solve mathematical reasoning problems, learning to produce a correct final answer through a step-by-step thought process. The agent is implemented as a **`ReActAgent`**, which follows a reasoning-acting loop to solve tasks iteratively. + +### Objective of the Data-Centric Approach + +Training can be inefficient if tasks are too easy or too hard. This example addresses this by providing **selectors** to dynamically select tasks using **data feedback**. This empowers users to explore and implement their own data-centric strategies, such as focusing on "productively challenging" samples, to maximize training efficiency. + +## Dataset Preparation + +To enable difficulty-based sampling, our training data needs to include features that represent the "difficulty" of each task. + +1. **Base Dataset**: You can use any standard math problem dataset. A good example is the math data in [LLM360/guru-RL-92k](https://huggingface.co/datasets/LLM360/guru-RL-92k), which comes pre-annotated with pass rates from different LLMs, serving as direct difficulty features. +2. **Build Your Own Features**: If you use your own dataset, you can generate these features by pre-running several models of varying capabilities and recording their pass rates. This can be done within the [**Trinity-RFT**](https://github.com/modelscope/Trinity-RFT/pull/440) framework. +3. **Data Format**: The final dataset should be in HuggingFace format. In this example, data will be transferred to *GSM8K format* according to the [workflow](../react_agent/main.py). Besides the task content, it must include the difficulty feature columns you've defined (e.g., `qwen_7b_pass_rate`, `qwen_30b_pass_rate`). +4. **Example Data Preparation**: We provide a script for this example. Simply execute `python prepare_data.py` to generate the required dataset. + +## Code Implementation + +### Agent Workflow & Judge Function + +This example follows the foundational [math-agent example](../react_agent/main.py), adopting its `run_react_agent` and `gsm8k_judge` as the `workflow_func` and `judge_func`, respectively. This highlights a key benefit: you can apply training strategies without altering your core agent logic. + +### Design of Data-Centric Features + +Leveraging the powerful data processing capabilities of **Trinity-RFT**, **AgentScope-Tuner** provides interfaces for advanced operations like task selection and experience processing. + +#### Task Selector + +The `Task Selector` determines how samples are selected from a dataset. It can be configured directly in `Yaml Config`, or within the `Dataset` object in Python script. + +- **Built-in Selectors**: + - `sequential`: Samples are selected in a fixed order. + - `shuffle`: The dataset is shuffled at the beginning of each epoch. + - `random`: Samples are randomly chosen with replacement for each batch. + - `offline_easy2hard`: Samples are sorted by a predefined feature for curriculum learning. + - `difficulty_based` (Customized): An adaptive sampler based on task difficulty. + +> For more details on `Task Selector`, including how to implement a custom selector based on feedback signals, please refer to **Trinity-RFT**'s **[Selector Development Guide](https://github.com/modelscope/Trinity-RFT/blob/main/docs/sphinx_doc/source/tutorial/develop_selector.md)**. + +#### Data Processor + +The `Data Processor` allows for real-time processing of **Task** and **Experience** during training, enabling operations like calculating feedback metrics, data augmentation, or filtering. + +For example, the `difficulty_based` selector requires a `pass_rate_calculator` operator to compute the agent's success rate for each task. This feedback is then used to adjust the sampling strategy. + +> For more details on `Data Processor`, please refer to **Trinity-RFT**'s **[Operator Development Guide](https://github.com/modelscope/Trinity-RFT/blob/main/docs/sphinx_doc/source/tutorial/develop_operator.md)**. + + +### Configuring the Experiments + +To maintain clarity and simplicity, we recommend defining all experiment-specific parameters, including dataset paths and task selectors, within YAML configuration files. + +We provide two configuration files to compare the baseline `random` selector against the `difficulty_based` selector. + +**Experiment 1: Baseline with Random Selector (`config_random.yaml`)** + +In `config_random.yaml`, we configure the `task_selector` for random sampling under `buffer.explorer_input.taskset`. + +```yaml +# In config_random.yaml +buffer: + # ... + explorer_input: + taskset: # Training data + path: "path/to/your/augmented/math_data" + split: "train" + task_selector: + selector_type: random # Strategy of task selection +``` + +**Experiment 2: Advanced Training with Difficulty-Based Selector (`config_difficulty.yaml`)** + +In `config_difficulty.yaml`, we switch the `task_selector` to difficulty_based and provide its specific parameters. Note that this config also enables the `pass_rate_calculator` needed for feedback. + +```yaml +# In config_difficulty.yaml + +# Enable the calculator to provide feedback for the selector +data_processor: + experience_pipeline: + operators: + - name: pass_rate_calculator + +buffer: + # ... + explorer_input: + taskset: # Training data + path: "path/to/your/augmented/math_data" + split: "train" + task_selector: + selector_type: difficulty_based # Strategy of task selection + feature_keys: [ "qwen_7b_pass_rate", "qwen_30b_pass_rate" ] + kwargs: # Hyper-parameters for the selection algorithm + m: 8 + # ... +``` + +> The `difficulty_based` selector in this example is an implementation of the ***BOTS*** algorithm. For details on its inner workings, please refer to the [***BOTS paper***](https://arxiv.org/abs/2510.26374) and its [***tutorials***](https://github.com/modelscope/Trinity-RFT/blob/main/examples/bots/README.md). + +## How to Run + +### Step 1: Prerequisites + +Ensure you have installed **AgentScope** and **Trinity-RFT** with [the guidance](../react_agent/README.md). + +### Step 2: Prepare the Dataset + +Run the data preparation script. Make sure to update the dataset paths in `config_random.yaml` and `config_difficulty.yaml` afterward. + +```bash +python prepare_data.py +``` + +### Step 3: Start Ray Cluster + +For distributed training, start a Ray cluster. + +```bash +# For single node +ray start --head +``` + +### Step 4: Run Training + +You can now run either the baseline or the difficulty-based training experiment. + +- **To run the baseline experiment with a random selector:** + +```bash +python main.py --config config_random.yaml +``` + +- **To run the experiment with the difficulty-based selector:** +```bash +python main.py --config config_difficulty.yaml +``` + +## Experimental Results + +The following results compare the performance of the `difficulty-based` selection strategy (red line, bots) against a standard `random` selection strategy (black line, random). + +![Training Result Image](./training_result.jpg) + +### Training Reward Curve + +The chart on the left shows the rollout accuracy during training. As can be seen, the tasks sampled by the random strategy appear to be difficult for the model, with the accuracy remaining below 0.2. In contrast, using the difficulty selector results in a higher mean accuracy, indicating that the agent is engaging with more tasks that it can successfully solve. + +### Evaluation on AIME-24 + +For comparison, we evaluated both selection strategies on the AIME-24 benchmark. The chart on the right shows that the difficulty-based method demonstrates a better upward trend in performance over time. diff --git a/examples/training/data_augment/config_difficulty.yaml b/examples/training/data_augment/config_difficulty.yaml new file mode 100644 index 0000000000..422dca8a0c --- /dev/null +++ b/examples/training/data_augment/config_difficulty.yaml @@ -0,0 +1,74 @@ +project: "Data-Augmentation" # Project name +name: "Difficulty-Based-Selector" # Experiment name +checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints} # Directory to save model checkpoints + +data_processor: + experience_pipeline: + operators: + - name: pass_rate_calculator # Calculate average reward and pass it back to selector + +buffer: + total_epochs: 1 # Total training epochs + explorer_input: + taskset: + path: "path/to/your/augmented/math_data" # Training data path + split: "train" # Training data split + task_selector: + selector_type: difficulty_based # Strategy of task selection + feature_keys: [ "qwen2.5_7b_pass_rate", "qwen3_30b_pass_rate" ] # Utilized pass_rate key + kwargs: # Hyperparameter from [BOTS](https://github.com/modelscope/Trinity-RFT/blob/main/examples/bots/README.md) + m: 8 + lamb: 0.1 + rho: 0.1 + target_reward: 0.8 + tau: 0 + do_sample: true + eval_tasksets: + - name: "eval-aime24" # Evaluation data name + path: "path/to/aime24_data" # Evaluation data path + split: "test" # Evaluation data split + +synchronizer: + sync_style: dynamic_by_explorer # Sync triggered dynamically by explorer + sync_method: 'nccl' + sync_interval: 4 # Sync every N steps + sync_timeout: 7200 # Timeout for synchronization (seconds) + +monitor: + monitor_type: tensorboard # Can also use wandb, mlflow or swanlab + +# The config below has been set in python file + +algorithm: + algorithm_type: multi_step_grpo # GRPO series for multi-step scenario + repeat_times: 8 # Number of rollouts per prompt for advantage estimation + optimizer: + lr: 1e-6 # Learning rate + +model: + model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen3-0.6B} # Base model path + max_model_len: 24576 # Max context length + max_response_tokens: 16384 # Max tokens per response + temperature: 1.0 # Temperature of model's generation + +cluster: + node_num: 1 # Number of used nodes + gpu_per_node: 8 # Number of GPUs every node + +explorer: + eval_interval: 20 # Evaluation every N steps + runner_per_model: 16 # Runners per infer engine + max_timeout: 1200 # Max timeout for each rollout (seconds) + rollout_model: + engine_num: 4 # Number of vLLM engines for rollout model + tensor_parallel_size: 1 # TP size per engine for rollout model + enable_openai_api: true # Enable OpenAI-compatible API + enable_history: true # Enable conversation history + enable_auto_tool_choice: true # Enable automatic tool selection + tool_call_parser: hermes # Parser for tool calls + reasoning_parser: deepseek_r1 # Parser for reasoning type + +trainer: + save_interval: 100 # Save checkpoint every N steps + use_dynamic_bsz: true # Use dynamic batch size + ulysses_sequence_parallel_size: 1 # Sequence parallel size for Ulysses \ No newline at end of file diff --git a/examples/training/data_augment/config_random.yaml b/examples/training/data_augment/config_random.yaml new file mode 100644 index 0000000000..f80d524e62 --- /dev/null +++ b/examples/training/data_augment/config_random.yaml @@ -0,0 +1,62 @@ +project: "Data-Augmentation" # Project name +name: "Random-Selector" # Experiment name +checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints} # Directory to save model checkpoints + +# Config of data-centric experiments +buffer: + total_epochs: 1 # Total training epochs + explorer_input: + taskset: + path: "path/to/your/augmented/math_data" # Training data path + split: "train" # Training data split + task_selector: + selector_type: random # Strategy of task selection + eval_tasksets: + - name: "eval-aime24" # Evaluation data name + path: "path/to/aime24_data" # Evaluation data path + split: "test" # Evaluation data split + +synchronizer: + sync_style: dynamic_by_explorer # Sync triggered dynamically by explorer + sync_method: 'nccl' + sync_interval: 4 # Sync every N steps + sync_timeout: 7200 # Timeout for synchronization (seconds) + +monitor: + monitor_type: tensorboard # Can also use wandb, mlflow or swanlab + +# The config below has been set in python file + +algorithm: + algorithm_type: multi_step_grpo # GRPO series for multi-step scenario + repeat_times: 8 # Number of rollouts per prompt for advantage estimation + optimizer: + lr: 1e-6 # Learning rate + +model: + model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen3-0.6B} # Base model path + max_model_len: 24576 # Max context length + max_response_tokens: 16384 # Max tokens per response + temperature: 1.0 # Temperature of model's generation + +cluster: + node_num: 1 # Number of used nodes + gpu_per_node: 8 # Number of GPUs every node + +explorer: + eval_interval: 20 # Evaluation every N steps + runner_per_model: 16 # Runners per infer engine + max_timeout: 1200 # Max timeout for each rollout (seconds) + rollout_model: + engine_num: 4 # Number of vLLM engines for rollout model + tensor_parallel_size: 1 # TP size per engine for rollout model + enable_openai_api: true # Enable OpenAI-compatible API + enable_history: true # Enable conversation history + enable_auto_tool_choice: true # Enable automatic tool selection + tool_call_parser: hermes # Parser for tool calls + reasoning_parser: deepseek_r1 # Parser for reasoning type + +trainer: + save_interval: 100 # Save checkpoint every N steps + use_dynamic_bsz: true # Use dynamic batch size + ulysses_sequence_parallel_size: 1 # Sequence parallel size for Ulysses \ No newline at end of file diff --git a/examples/training/data_augment/main.py b/examples/training/data_augment/main.py new file mode 100644 index 0000000000..942de083c9 --- /dev/null +++ b/examples/training/data_augment/main.py @@ -0,0 +1,149 @@ +# -*- coding: utf-8 -*- +"""Example of training a ReAct math-agent with configurable task selector.""" +from typing import Dict + +from agentscope.tuner import ( + tune, + Dataset, + WorkflowOutput, + JudgeOutput, + TunerChatModel, + Algorithm, +) +from agentscope.agent import ReActAgent +from agentscope.formatter import OpenAIChatFormatter +from agentscope.message import Msg + + +async def run_react_agent( + task: Dict, + model: TunerChatModel, + auxiliary_models: Dict[str, TunerChatModel], +) -> WorkflowOutput: + """A simple workflow function using the ReAct agent to solve tasks. + + Args: + task (Dict): The task to be solved. + model (TunerChatModel): The language model to use. + auxiliary_models (Dict[str, TunerChatModel]): + A dictionary of additional chat models available for + LLM-as-a-Judge. Not used in this workflow. + + Returns: + float: The reward obtained by solving the task. + """ + assert ( + len(auxiliary_models) == 0 + ), "No auxiliary models are used in this workflow." + + sys_prompt = ( + "You are an agent specialized in solving math problems with tools. " + "Please solve the math problem given to you. You can write and " + "execute Python code to perform calculation or verify your answer. " + "You should return your final answer within \\boxed{{}}." + ) + agent = ReActAgent( + name="react_agent", + sys_prompt=sys_prompt, + model=model, + enable_meta_tool=True, + formatter=OpenAIChatFormatter(), + ) + response = await agent.reply( + msg=Msg("user", task["question"], role="user"), + ) + return WorkflowOutput( + response=response, + ) + + +async def gsm8k_judge( + task: Dict, + response: Msg, + auxiliary_models: Dict[str, TunerChatModel], +) -> JudgeOutput: + """A simple judge function to calculate reward based on agent's response. + + Args: + task (Dict): The task information for the corresponding workflow. + response (Msg): The response generated by the corresponding workflow. + auxiliary_models (Dict[str, TunerChatModel]): + A dictionary of additional chat models available for LLM-as-a-Judge + usage. The keys are model names, and the values are the + corresponding TunerChatModel instances. + + Returns: + JudgeOutput: The reward value assigned by the judge function. + """ + from trinity.common.rewards.math_reward import MathBoxedRewardFn + + assert ( + len(auxiliary_models) == 0 + ), "No auxiliary models are used in this workflow." + + reward_fn = MathBoxedRewardFn() + # parse truth from gsm8k raw text + truth = task["answer"] + if isinstance(truth, str) and "####" in truth: + truth = truth.split("####")[1].strip() + else: + truth = str(truth) + # parse answer from response message + result = response.get_text_content() + reward_dict = reward_fn( + response=result, + truth=truth, + ) + return JudgeOutput( + reward=sum(reward_dict.values()), + metrics=reward_dict, + ) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser( + description="Train math-agent with different task selectors" + ) + parser.add_argument( + "--config", + type=str, + default="config_random.yaml", + help="Path to the configuration YAML file", + ) + args = parser.parse_args() + + # You can optionally configure them via Python Dataset objects, but we + # recommend using YAML for data-centric experiments. + + # train_dataset = Dataset( + # path="path/to/your/augmented/math_data", + # split="train", + # task_selector={ + # 'selector_type': 'random', + # }, + # ) + + tuner_model = TunerChatModel( + model_path="Qwen/Qwen3-0.6B", + max_model_len=24576, + max_tokens=16384, + temperature=1.0, + inference_engine_num=4, + tensor_parallel_size=1, + ) + + algorithm = Algorithm( + algorithm_type="multi_step_grpo", + group_size=8, + learning_rate=1e-6, + eval_interval_steps=20, + batch_size=16, + ) + + tune( + workflow_func=run_react_agent, + judge_func=gsm8k_judge, + config_path=args.config, + model=tuner_model, + algorithm=algorithm, + ) \ No newline at end of file diff --git a/examples/training/data_augment/prepare_data.py b/examples/training/data_augment/prepare_data.py new file mode 100644 index 0000000000..f2873a8597 --- /dev/null +++ b/examples/training/data_augment/prepare_data.py @@ -0,0 +1,136 @@ +# -*- coding: utf-8 -*- +""" +Prepare math data from LLM360/guru-RL-92k +Transfer to the GSM8K Format +""" + +import argparse +import sys +from pathlib import Path +import pandas as pd +from huggingface_hub import hf_hub_download + +# Define constants for the dataset +DATASET_REPO = "LLM360/guru-RL-92k" +DATASET_FILE = "train/math__combined_54.4k.parquet" + + +# Download the dataset from Hugging Face Hub. +# The dataset is from LLM360/guru-RL-92k. +def download_dataset(repo_id: str, filename_in_repo: str, local_dir: str) -> Path: + print(f"--- Downloading dataset: {repo_id} ---") + print(f"File: {filename_in_repo}") + + local_path = Path(local_dir) + local_path.mkdir(parents=True, exist_ok=True) + + try: + downloaded_file_path = hf_hub_download( + repo_id=repo_id, + filename=filename_in_repo, + repo_type="dataset", + local_dir=local_path, + ) + print(f"Successfully downloaded to: {downloaded_file_path}") + return Path(downloaded_file_path) + except Exception as e: + print(f"Error downloading dataset: {e}", file=sys.stderr) + sys.exit(1) + + +# Transform a single row from the original format to the target format. +def transform_row(row: pd.Series) -> pd.Series: + try: + original_question = row['prompt'][0]['content'] + sentence_to_remove = "Please output the final answer within \\boxed{}." + question = original_question.replace(sentence_to_remove, "").strip() + + ground_truth = row['reward_model']['ground_truth'] + answer = f"#### {ground_truth}" + + rate_7b = row.get('qwen2.5_7b_pass_rate') + rate_30b = row.get('qwen3_30b_pass_rate') + + return pd.Series({ + "question": question, + "answer": answer, + "qwen2.5_7b_pass_rate": rate_7b, + "qwen3_30b_pass_rate": rate_30b + }) + except (TypeError, IndexError, KeyError) as e: + print(f"Skipping row due to processing error: {e}. Row content: {row.to_dict()}", file=sys.stderr) + return pd.Series({ + "question": None, + "answer": None, + "qwen2.5_7b_pass_rate": None, + "qwen3_30b_pass_rate": None + }) + + +# Read, transform, and save the dataset to a new location. +def transform_and_save_dataset(input_file: Path, output_dir: str): + output_path = Path(output_dir) + output_path.mkdir(parents=True, exist_ok=True) + output_file_path = output_path / input_file.name + + print(f"--- Reading source file: {input_file} ---") + try: + df_original = pd.read_parquet(input_file) + print(f"Successfully read {len(df_original)} records.") + except Exception as e: + print(f"Fatal error reading file: {e}", file=sys.stderr) + sys.exit(1) + + print("--- Starting data transformation ---") + df_transformed = df_original.apply(transform_row, axis=1) + + original_count = len(df_transformed) + df_transformed.dropna(subset=['question', 'answer'], inplace=True) + dropped_count = original_count - len(df_transformed) + if dropped_count > 0: + print(f"Warning: Dropped {dropped_count} invalid records due to processing errors.") + + print(f"Transformation complete. {len(df_transformed)} valid records generated.") + + print(f"--- Saving processed file to: {output_file_path} ---") + try: + df_transformed.to_parquet(output_file_path, index=False) + print(f"Process complete! New file saved at: {output_file_path}") + except Exception as e: + print(f"Fatal error saving file: {e}", file=sys.stderr) + sys.exit(1) + + +def main(): + parser = argparse.ArgumentParser( + description="Download and transform the guru-RL-92k math dataset." + ) + parser.add_argument( + "--raw_data_dir", + type=str, + default="data/train/raw", + help="Directory to download the raw dataset file." + ) + parser.add_argument( + "--processed_data_dir", + type=str, + default="data/train/math", + help="Directory to save the transformed dataset file." + ) + + args = parser.parse_args() + + downloaded_file = download_dataset( + repo_id=DATASET_REPO, + filename_in_repo=DATASET_FILE, + local_dir=args.raw_data_dir + ) + + transform_and_save_dataset( + input_file=downloaded_file, + output_dir=args.processed_data_dir + ) + + +if __name__ == "__main__": + main() diff --git a/examples/training/data_augment/training_result.jpg b/examples/training/data_augment/training_result.jpg new file mode 100644 index 0000000000..8f17147da8 Binary files /dev/null and b/examples/training/data_augment/training_result.jpg differ diff --git a/src/agentscope/tuner/_config.py b/src/agentscope/tuner/_config.py index f60e36d4f0..fb954aef90 100644 --- a/src/agentscope/tuner/_config.py +++ b/src/agentscope/tuner/_config.py @@ -30,6 +30,7 @@ def to_trinity_config( TasksetConfig, load_config, InferenceModelConfig, + TaskSelectorConfig, ) auto_config = False @@ -56,6 +57,10 @@ def to_trinity_config( split=train_dataset.split, subset_name=train_dataset.name, ) + if train_dataset.task_selector is not None: + config.buffer.explorer_input.taskset.task_selector = TaskSelectorConfig( + **train_dataset.task_selector + ) else: config.buffer.explorer_input.taskset.path = train_dataset.path config.buffer.explorer_input.taskset.split = train_dataset.split @@ -102,6 +107,8 @@ def to_trinity_config( workflow_args=workflow_args, ), ) + for eval_taskset in config.buffer.explorer_input.eval_tasksets: + eval_taskset.workflow_args = workflow_args if algorithm is not None: config.algorithm.algorithm_type = algorithm.algorithm_type config.algorithm.repeat_times = algorithm.group_size diff --git a/src/agentscope/tuner/_dataset.py b/src/agentscope/tuner/_dataset.py index dd816c2d68..61911ff864 100644 --- a/src/agentscope/tuner/_dataset.py +++ b/src/agentscope/tuner/_dataset.py @@ -1,7 +1,7 @@ # -*- coding: utf-8 -*- """Dataset definition for tuner.""" from itertools import islice -from typing import Optional, List +from typing import Optional, List, Dict, Any from pydantic import BaseModel, Field @@ -34,6 +34,10 @@ class Dataset(BaseModel): ), default=None, ) + task_selector: Optional[Dict[str, Any]] = Field( + description=("Configuration for the task selector."), + default=None, + ) def preview(self, n: int = 5) -> List: """Preview the dataset information.