Marco Search Agent: Towards Real‑World and Challenging Agentic Search

🍓 Alibaba International Digital Commerce 🍓

📝 HSCodeComp Paper 📝 DeepWideSearch Paper 🤗 HSCodeComp Dataset 🤗 DeepWideSearch Dataset

🎯 Marco-Search-Agent introduces two challenging agent benchmarks that expose critical gaps in current AI systems across two fundamental dimensions:

HSCodeComp (HSCodeComp): Evaluates hierarchical rule application in E-Commerce—"What is the correct 10-digit HSCode for this silicone medical bracelet?". This benchmark tests the ability to apply complex, ambiguous rules embedded with hierarchical decision logic (e.g., tariff, legal, medical manuals) in domain-specific applications.
DeepWideSearch (DeepWideSearch): Evaluates deep-and-wide information seeking—"List all second-tier suppliers of Apple's AirPods, with contact info, location, and certification status." This benchmark requires agents to simultaneously discover a large volume of candidates through wide-scale exploration and perform deep reasoning over multi-hop retrieval for each candidate.

These applications reveal fundamental limitations between current AI agents and human experts in critical yet underexplored dimensions of real-world applications.

🔥 News

[2025-10] 🔥 We released Marco-Search-Agent. This initial release includes two challenging benchmarks for cutting-edge agent systems——DeepWideSearch and HSCodeComp benchmarks.

📦 Included Benchmarks

📑 HSCodeComp

Evaluating Advanced Agent Systems on Hierarchical Rule Application in E-Commerce Domain

Task: Predict 10-digit Harmonized System (HS) Code from noisy product listings using official tariff rules.
Size: 632 expert-annotated products
Domains: 27 HS chapters, 32 e-commerce categories
Key Challenge: Hierarchical rules contain vague language and implicit decision logic.
Human Performance: 95.0% (10-digit accuracy)
Best AI (SmolAgent + GPT-5 VLM): 46.8%

💡 Reveals that even top-performming agents fail at complex hierarchical rule application—a core skill in numerous important vertical domains, like law, medical, customs, and taxation.

🌐 DeepWideSearch

Evaluating Advanced Agent Systems on Deep-and-Wide Agentic Information Seeking

Task: Answer complex queries by producing structured tables (entities × attributes).
Size: 220 multi-hop, multi-attribute questions (English & Chinese)
Avg. Output: 414 information units per answer
Avg. Reasoning Depth: 4.21 steps
Best AI (WebSailor + Claude Sonnet 4): 2.39% Success Rate

💡 Shows that advanced AI agents achieve only nearly 2% success rate with huge inference cost.

⚡️ Released Resources

Dataset	Huggingface	GitHub
HSCodeComp	🤗 AIDC-AI/HSCodeComp	HSCodeComp/data
DeepWideSearch	🤗 AIDC-AI/DeepWideSearch	DeepWideSearch/data

🚀 Quick Start

Repository Structure

Marco-Search-Agent/
├── HSCodeComp/
│   ├── data/
│   ├── assets/
│   ├── eval/
│   ├── LICENSE
│   ├── NOTICE
│   └── README.md
├── DeepWideSearch/
│   ├── data/
│   ├── assets/
│   ├── eval/
│   ├── scripts/
│   ├── LICENSE
│   ├── NOTICE
│   ├── requirements.txt
│   └── README.md
├── assets
├── LICENSE
└── README.md

Please refer to HSCodeComp and DeepWideSearch README files for more details about installation and usage.

Evaluate Your Agent

Please refer to README in these two projects.

For HSCodeComp: Use HSCodeComp/eval/test_llm.py to score 10-digit HSCode predictions.
For DeepWideSearch: Use DeepWideSearch/scripts/batch_eval.sh.

👨🏻‍💻 Acknowledgements

Main contributors are from AI Business, Alibaba International Digital Commerce. You could contact us via Tian Lan and Longyue Wang.

HSCodeComp thanks human tariff experts for meticulous annotation (hourly wage: >$34/hr).
DeepWideSearch builds upon the open-source WideSearch framework by ByteDance-Seed. We gratefully acknowledge their pioneering work and MIT-licensed codebase.

🛡️ License

This project is licensed under the Apache-2.0 License

⚠️ DISCLAIMER

Our datasets are constructed using publicly accessible data sources. For instance, HSCodeComp utilizes product data from real e-commerce platforms, while DeepWideSearch is built upon BrowseComp, BrowseComp-ZH, and WideSearch datasets. Due to the complexity of these tasks and the diverse nature of the underlying data, we cannot guarantee that our datasets are completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Marco Search Agent: Towards Real‑World and Challenging Agentic Search

🔥 News

📦 Included Benchmarks

📑 HSCodeComp

🌐 DeepWideSearch

⚡️ Released Resources

🚀 Quick Start

Repository Structure

Evaluate Your Agent

👨🏻‍💻 Acknowledgements

🛡️ License

⚠️ DISCLAIMER

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
DeepWideSearch		DeepWideSearch
HSCodeComp		HSCodeComp
assets		assets
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Uh oh!

License

Uh oh!

AIDC-AI/Marco-Search-Agent

Folders and files

Latest commit

History

Repository files navigation

Marco Search Agent: Towards Real‑World and Challenging Agentic Search

🔥 News

📦 Included Benchmarks

📑 HSCodeComp

🌐 DeepWideSearch

⚡️ Released Resources

🚀 Quick Start

Repository Structure

Evaluate Your Agent

👨🏻‍💻 Acknowledgements

🛡️ License

⚠️ DISCLAIMER

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages