β MarcoPolo Team β
Alibaba International Digital Commerce
Github π€ Hugging Face π Paper π§βπ» Model ποΈ Data π½οΈ Demo
π― Marco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and codingβwhich are well-suited for reinforcement learning (RL)βbut also places greater emphasis on open-ended resolutions. We aim to address the question: "Can the o1 model effectively generalize to broader domains where clear standards are absent and rewards are challenging to quantify?"
Currently, Marco-o1 Large Language Model (LLM) is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and innovative reasoning strategiesβoptimized for complex real-world problem-solving tasks.
Figure 1: A classic 'strawberry' question reasoned by our Marco-o1 model: "How many 'r' are in strawberry". Although the answer is correct, the final letter 'y' is overlooked during CoT. This is an interesting finding, which is discussed in issue AIDC-AI#3.
Currently, our work is distinguished by the following highlights:
- π Fine-Tuning with CoT Data: We develop Marco-o1-CoT by performing full-parameter fine-tuning on the base model using open-source CoT dataset combined with our self-developed synthetic data.
- π Solution Space Expansion via MCTS: We integrate LLMs with MCTS (Marco-o1-MCTS), using the model's output confidence to guide the search and expand the solution space.
- π Reasoning Action Strategy: We implement novel reasoning action strategies and a reflection mechanism (Marco-o1-MCTS mini-step), including exploring different action granularities within the MCTS framework and prompting the model to self-reflect, thereby significantly enhancing the model's ability to solve complex problems.
- π Application in Translation Tasks: We are the first to apply Large Reasoning Models (LRM) to Machine Translation task, exploring inference time scaling laws in the multilingual and translation domain.
-
[Coming Soon] π Reward Models: We are working on training reward models, including Outcome Reward Modeling (ORM) and Process Reward Modeling (PRM), to provide a more accurate reward signal for MCTS. A more precise reward function will help reduce randomness in tree search results and improve overall performance.
-
[Coming Soon] π Reinforcement Learning: We are conducting reinforcement learning training to develop better reasoning models. By utilizing RL techniques, we aim to refine the model's decision-making processes and further enhance its problem-solving abilities.
- [2024/11/13] π₯ We released Marco-o1. This initial release includes our reasoning model, optimized for complex problem-solving and versatile applications across various domains.
OpenAI recently introduced the groundbreaking o1 model, renowned for its exceptional reasoning capabilities. This model has demonstrated outstanding performance on platforms such as AIME and CodeForces, surpassing other leading models. Inspired by this success, we aimed to push the boundaries of LLMs even further, enhancing their reasoning abilities to tackle complex, real-world challenges.
π Marco-o1 leverages advanced techniques like CoT fine-tuning, MCTS, and Reasoning Action Strategies to enhance its reasoning power. As shown in Figure 2, by fine-tuning Qwen2-7B-Instruct with a combination of the filtered Open-O1 CoT dataset, Marco-o1 CoT dataset, and Marco-o1 Instruction dataset, Marco-o1 improved its handling of complex tasks. MCTS allows exploration of multiple reasoning paths using confidence scores derived from softmax-applied log probabilities of the top-k alternative tokens, guiding the model to optimal solutions. Moreover, our reasoning action strategy involves varying the granularity of actions within steps and mini-steps to optimize search efficiency and accuracy.
π As shown in Figure 3, Marco-o1 achieved accuracy improvements of +6.17% on the MGSM (English) dataset and +5.60% on the MGSM (Chinese) dataset, showcasing enhanced reasoning capabilities.
π Additionally, in translation tasks, we demonstrate that Marco-o1 excels in translating slang expressions, such as translating "θΏδΈͺιζ₯ζθΈ©ε±ζ" (literal translation: "This shoe offers a stepping-on-poop sensation.") to "This shoe has a comfortable sole," demonstrating its superior grasp of colloquial nuances.
To enhance the reasoning capabilities of the Marco-o1 model, we employed a SFT strategy using a variety of datasets.
-
π Open-O1 CoT Dataset (Filtered): We refined the Open-O1 project's CoT Dataset by applying heuristic and quality filtering processes. This enhancement allowed the model to adopt structured reasoning patterns effectively.
-
π Marco-o1 CoT Dataset (Synthetic): We generated the Marco-o1 CoT Dataset using MCTS, which helped to formulate complex reasoning pathways, further bolstering the model's reasoning capabilities.
-
π Marco Instruction Dataset: Recognizing the critical role of robust instruction-following capabilities in executing complex tasks, we incorporated a set of instruction-following data. This integration ensures the model remains competent across a wide range of tasks, maintaining its general effectiveness while significantly boosting its reasoning flair.
Dataset | #Samples |
---|---|
Open-O1 CoT Dataset (Filtered) | 45,125 |
Marco-o1 CoT Dataset (Synthetic) | 10,000 |
Marco Instruction Dataset | 5,141 |
Total | 60,266 |
π₯ Marco Reasoning Dataset (Our Partial Dataset)
We integrated LLMs with MCTS to enhance the reasoning capabilities of our Marco-o1 model:
- π Nodes as Reasoning States: In the MCTS framework, each node represents a reasoning state of the problem-solving process.
- π Actions as LLM Outputs: The possible actions from a node are the outputs generated by the LLM. These outputs represent potential steps or mini-steps in the reasoning chain.
- π Rollout and Reward Calculation: During the rollout phase, the LLM continues the reasoning process to a terminal state.
- π Guiding MCTS: This reward score
$R$ is used to evaluate and select promising paths within the MCTS, effectively guiding the search towards more confident and reliable reasoning chains.
Furthermore, we obtain the value of each state by computing a confidence score using the following formulas:
-
Confidence Score (
$c_i$ ):For each token
$t_i$ generated during the rollout, we calculate its confidence score by applying the softmax function to its log probability and the log probabilities of the top 5 alternative tokens. This is given by:$$c_i = \frac{\exp(p(t_i))}{\sum_{k=1}^{5} \exp(p(t_k))}$$ where:
-
$c_i$ is the confidence score for the$i^{th}$ token in the rollout. -
$p(t_i)$ is the log probability of the$i^{th}$ token generated by the LLM. -
$p(t_k)$ for$k = 1$ to$5$ are the log probabilities of the top 5 predicted tokens at the$i^{th}$ step. -
$n$ is the total number of tokens in the rollout sequence.
This equation ensures that the confidence score reflects the relative probability of the chosen token compared to the top alternatives, effectively normalizing the scores between 0 and 1.
-
-
Reward Score (
$v$ ):After obtaining the confidence scores for all tokens in the rollout sequence, we compute the average confidence score across all tokens to derive the overall reward score:
$$v = \frac{1}{n} \sum_{i=1}^{n} c_i$$ where:
$v$ is the overall reward score for the rollout path.This average serves as the reward signal that evaluates the quality of the reasoning path taken during the rollout. A higher
$v$ indicates a more confident and likely accurate reasoning path.
By employing this method, we effectively expand the solution space, allowing the model to explore a vast array of reasoning paths and select the most probable ones based on calculated confidence scores.
We observed that using actions as the granularity for MCTS search is relatively coarse, often causing the model to overlook nuanced reasoning paths crucial for solving complex problems. To address this, we explored different levels of granularity in the MCTS search. Initially, we used steps as the unit of search. To further expand the model's search space and enhance its problem-solving capabilities, we experimented with dividing these steps into smaller units of 64 or 32 tokens, referred to as "mini-step." This finer granularity allowed the model to explore reasoning paths in greater detail. While token-level search offers theoretical maximum flexibility and granularity, it is currently impractical due to the significant computational resources required and the challenges associated with designing an effective reward model at this level.
In our experiments, we implemented the following strategies within the MCTS framework:
-
π step as Action: We allowed the model to generate complete reasoning steps as actions. Each MCTS node represents an entire thought or action label. This method enables efficient exploration but may miss finer-grained reasoning paths essential for complex problem-solving.
-
π mini-step as Action: We used mini-steps of 32 or 64 tokens as actions. This finer granularity expands the solution space and improves the model's ability to navigate complex reasoning tasks by considering more nuanced steps in the search process. By exploring the solution space at this level, the model is better equipped to find correct answers that might be overlooked with larger action units.
We introduced a reflection mechanism by adding the phrase "Wait! Maybe I made some mistakes! I need to rethink from scratch." at the end of each thought process. This prompts the model to self-reflect and reevaluate its reasoning steps. Implementing this reflection has yielded significant improvements, especially on difficult problems that the original model initially solved incorrectly. With the addition of reflection, approximately half of these challenging problems were answered correctly.
From the self-critic perspective, this approach allows the model to act as its own critic, identifying potential errors in its reasoning. By explicitly prompting the model to question its initial conclusions, we encourage it to re-express and refine its thought process. This self-critical mechanism leverages the model's capacity to detect inconsistencies or mistakes in its own output, leading to more accurate and reliable problem-solving. The reflection step serves as an internal feedback loop, enhancing the model's ability to self-correct without external intervention.
Based on π‘ Qwen2-7B-Instruct, we performed SFT using our training data to create π‘ Marco-o1-CoT. Besides, we employed Marco-o1-CoT within the framework of MCTS tree search, differentiating by actions:
- π‘ Marco-o1-MCTS (step): using each inference step as an action (step).
- π‘ Marco-o1-MCTS (mini-step of 64 tokens): using a 64-token mini-step as an action (64 tokens).
- π‘ Marco-o1-MCTS (mini-step of 64 tokens): using a 32-token mini-step as an action (32 tokens).
During testing, each model utilized a CoT prompt to ensure consistency in reasoning processes. We then tested these configurations on the English (En) and Chinese (Zh) subsets of the MGSM dataset, obtaining the following results:
Model | MGSM-En (Acc.) | MGSM-Zh (Acc.) |
---|---|---|
Qwen2-7B-Instruct | 84.00% | 76.80% |
Marco-o1-CoT | 85.60% | 71.20% |
Marco-o1-MCTS (step) | 90.40% | 80.00% |
Marco-o1-MCTS (mini-step of 64 tokens) | 88.40% | 80.40% |
Marco-o1-MCTS (mini-step of 32 tokens) | 87.60% | 82.40% |
π₯ Marco-o1-CoT (Our Lastest Model)
π¬ These results demonstrate that:
-
Performance of Marco-o1-CoT vs. Qwen2-7B-Instruct:
- In the MGSM-en dataset, Marco-o1-CoT shows an advantage over Qwen2-7B-Instruct, as shown in Figure 4, which is expected due to the fine-tuning with English CoT data.
- In the MGSM-zh dataset, however, Marco-o1-CoT exhibits a decrease in performance compared to Qwen2-7B-Instruct. This decline is attributed to the fact that the CoT data used for fine-tuning was in English, which may not transfer effectively to the Chinese dataset.
-
Impact of MCTS Integration:
- The three MCTS-enhanced models demonstrate improvements over Marco-o1-CoT, indicating that incorporating MCTS helps to expand the model's solution space and increase the probability of obtaining correct answers.
- However, since we use the Confidence Score as the reward, the tree search results exhibit significant randomness. In MGSM-en, the "step as Action" strategy performs the best, while in MGSM-zh, the "mini-step as Action (32)" strategy yields the highest accuracy.
- Currently, as shown in Figure 5-6, we cannot draw definitive conclusions about which action strategy is superior. We believe that as the reward becomes more accurate, the larger solution space provided by MCTS will demonstrate greater potential.
Figure 4: MCTS Expands the Solution Space for Correct Answers. Comparison between Marco-o1-CoT (left) and Marco-o1-MCTS (step) (right) on the MGSM dataset. While Marco-o1-CoT failed to provide the correct answer, integrating MCTS with step-level actions allowed the model to explore a broader solution space, increasing the likelihood of arriving at the correct solution.
Figure 5: Finer Granularity with mini-steps Enhances Problem-Solving. Comparison between Marco-o1-MCTS (step) (left) and Marco-o1-MCTS (mini-step of 32 tokens) (right) on the MGSM dataset. The step-level action strategy did not yield the correct answer, but by using a finer-grained mini-step of 32 tokens, the model successfully navigated the solution space to find the correct answer, demonstrating the effectiveness of increased action granularity.
Figure 6: Optimal Action Granularity Depends on Problem Complexity. Comparison between Marco-o1-MCTS (mini-step of 64 tokens) (left) and Marco-o1-MCTS (step) (right) on the MGSM dataset. The model with a mini-step of 64 tokens failed to find the correct answer, whereas using step-level actions enabled the model to correctly solve the problem. This highlights that we cannot draw definitive conclusions about which action strategy is superior. We believe that as the reward becomes more accurate, the larger solution space provided by MCTS will demonstrate greater potential.
Furthermore, we use Test@N to denote the percentage of problems solved correctly at least once when allowing the model to make N separate guesses for each problem. We evaluated solve rates at Test@1, Test@8, and Test@32. The results demonstrate that MCTS shows an advantage with a lower number of separate guesses (Test@1). This reveals the potential of MCTS. In future work, we plan to train the reward model (RM) in conjunction with MCTS to continue optimizing our approach.
These results demonstrate the effectiveness of our approach in enhancing the reasoning capabilities of the model across different languages and configurations.
We have also conducted some open-ended tests on our models, such as translation issues and achieved some positive results. In the future, we will continue to explore other areas and improve the model's related performance.
Figure 7: Translation comparison of a colloquial expression βItβs so beautiful that itβs captivating, the upper part has a distinctly Korean style, the soft and fluffy material is perfectly thick, and itβs complemented by a base layer, creating a unique and everyday-wear outfitβ.
Figure 8: Translation comparison of a colloquial expression βItβs so beautiful! And itβs so cheap, super straight and doesnβt curl. Buy it, buy it!β.
π₯ Marco-o1-CoT (Our Lastest Model)
π₯ Marco Reasoning Dataset (Our Partial Dataset)
To install Marco-o1, follow these steps:
# Clone the repository
git clone https://github.com/AIDC-AI/Marco-o1
# Change to the Macaw-LLM directory
cd Marco-o1
# Install required packages
pip install -r requirements.txt
-
Load Marco-o1-CoT model:
# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("AIDC-AI/Marco-o1") model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Marco-o1")
-
Inference:
Execute the inference script (you can give any customized inputs inside):
./src/talk_with_model.py # Use vLLM ./src/talk_with_model_vllm.py
-
Deploy using FastAPI:
Check the README.md file in examples folder.
From MarcoPolo Team, AI Business, Alibaba International Digital Commerce:
- Yu Zhao
- Huifeng Yin
- Hao Wang
- Longyue Wang
If you find Marco-o1 useful for your research and applications, please cite:
@misc{zhao2024marcoo1openreasoningmodels,
title={Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions},
author={Yu Zhao and Huifeng Yin and Bo Zeng and Hao Wang and Tianqi Shi and Chenyang Lyu and Longyue Wang and Weihua Luo and Kaifu Zhang},
year={2024},
eprint={2411.14405},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.14405},
}
This project is licensed under Apache License Version 2 (SPDX-License-identifier: Apache-2.0).
We used compliance checking algorithms during the training process, to ensure the compliance of the trained model and dataset to the best of our ability. Due to complex data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.