Skip to content

Conversation

yayashuxue
Copy link
Contributor

@yayashuxue yayashuxue commented Sep 19, 2025

Summary

Integrates Tongyi's DeepResearch ReAct agent into rLLM for academic benchmarks (HLE). Provides universal model support with automatic adaptation for any OpenAI-compatible API.

Key Features

Agent Implementation

  • MultiTurnReactAgent: Full Tongyi ReAct loop with hybrid approach
    • Native OpenAI function calling for models that support it (e.g., O3)
    • XML <tool_call> format fallback for other models (e.g., GPT-4o)
    • Works with any OpenAI-compatible API (OpenAI, Together AI, vLLM, etc.)
  • Automatic parameter mapping: Handles model-specific requirements seamlessly
  • Accurate token counting: Uses API response tokens for precise context management

Production-Ready Tools

  • Search: Web search via Serper API with Google Custom Search fallback
  • Scholar: Google Scholar search through Serper API
  • Visit: Web page content extraction with BeautifulSoup
  • FileParser: Multi-format support (TXT, JSON, CSV, PDF, DOCX)
  • PythonInterpreter: Secure code execution with 50s timeout

Evaluation Pipeline

  • HLE benchmark support with parallel execution
  • Configurable judge model with binary yes/no scoring
  • Current accuracy: 26.67% (consistent with reported HLE difficulty)

Technical Highlights

  • Universal compatibility: Works with any OpenAI-compatible model
  • Automatic adaptation: Detects and handles model-specific requirements
  • Parallel execution: Concurrent task processing via AgentWorkflowEngine
  • Episode format: Outputs rLLM Episodes for training pipeline integration

Usage

# Basic evaluation (auto-detects model capabilities)
python examples/deepresearch/evaluate_hle.py --max-samples 10

# With different models
python examples/deepresearch/evaluate_hle.py --model gpt-4o
python examples/deepresearch/evaluate_hle.py --model o3-mini
python examples/deepresearch/evaluate_hle.py --model gpt-3.5-turbo

# Using Together AI models
python examples/deepresearch/evaluate_hle.py \
    --model meta-llama/Llama-3-70b-chat-hf \
    --base-url https://api.together.xyz/v1

# Parallel evaluation
python examples/deepresearch/evaluate_hle.py --parallel-tasks 8 --max-samples 100

Files Added

  • examples/deepresearch/deepresearch_agent.py - Core ReAct agent with hybrid support
  • examples/deepresearch/deepresearch_tools.py - Full tool implementations
  • examples/deepresearch/deepresearch_workflow.py - rLLM workflow wrapper
  • examples/deepresearch/evaluate_hle.py - HLE evaluation pipeline
  • examples/deepresearch/README.md - Documentation
  • examples/deepresearch/ALIGNMENT_ANALYSIS.md - Tongyi alignment analysis

Enhanced Core Components

  • rllm/engine/rollout/openai_engine.py - Adaptive parameter compatibility
  • rllm/engine/agent_workflow_engine.py - Improved parallel execution support

- Port original DeepResearch ReAct agent to work with rLLM's OpenAI engine
- Implement workflow wrapper for AgentWorkflowEngine compatibility
- Add real web search via Serper API (same as original DeepResearch)
- Support multi-turn reasoning with tool calling and trajectory tracking
- Enable parallel execution and RL-ready episode generation
- Preserve 95% of original DeepResearch logic and reasoning patterns
- Support OpenAI, Together AI, and custom vLLM model endpoints

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@yayashuxue yayashuxue changed the base branch from main to v0.2 September 19, 2025 05:06
@yayashuxue
Copy link
Contributor Author

@jeffreysijuntan please review it

yayashuxue and others added 11 commits September 29, 2025 22:50
Key fixes:
- Replace GPT-2 tokenizer with API token consumption tracking to fix context limit errors
- Fix infinite loops caused by incorrect token counting (was using 1024 limit for 128k models)
- Use actual API response.prompt_tokens and response.completion_tokens for accurate tracking

Improvements:
- Add comprehensive HLE evaluation script with judge-based scoring
- Update README to accurately reflect tool implementation status (Scholar/Visit are placeholders)
- Apply ruff linting and formatting to all files
- Clean up verbose debug prints while keeping useful status indicators
- Add better error handling and timeout management

The token counting issue was causing false "context exceeded" errors at ~13k tokens when
models actually support 128k. This led to incorrect message truncation and infinite loops
where the model would repeat the same response.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
All tools are now fully functional with real implementations:
- Search & Scholar: Use Serper API for Google/Scholar search (ported from Tongyi)
- Visit: Fetches and parses webpages with requests/BeautifulSoup
- FileParser: Enhanced to support TXT, JSON, CSV, PDF (PyPDF2), DOCX (python-docx)
- PythonInterpreter: Safe execution environment with timeout (already working)

The tools were ported directly from the original Tongyi DeepResearch implementation
to provide production-ready functionality instead of placeholders. This enables
the agent to perform real research tasks with actual web search, paper lookup,
webpage analysis, and multi-format file parsing capabilities.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…ng models

- Auto-detect and fix unsupported API parameters via error parsing
- Automatically remap max_tokens -> max_completion_tokens for o3/o1/gpt-5
- Remove unsupported sampling params (temperature, top_p, presence_penalty, etc.)
- Cache parameter fixes to avoid repeated warnings (log once per engine instance)
- Support future OpenAI models without code changes (try-catch-adapt pattern)
- Allow up to 10 parameter adjustments per request for reasoning models

This enables seamless usage of reasoning models (o3, o1, gpt-5, future models)
in rLLM workflows without manual parameter configuration.
- Fix token counter not resetting between tasks (caused early context limit)
- Fix Python tool missing exception classes in restricted environment
- Add scipy submodule support for scientific computing
- Fix o3 model handling when outputting both tool_call and answer
- Process tool calls before checking for answers to support o3 behavior
- Add better truncation for base64 images and long outputs
- Improve error handling in evaluation rating parsing

These fixes significantly improve evaluation quality and consistency.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Major changes:
1. Vision Support (multimodal images):
   - Added image handling in evaluate_hle.py extract_qa function
   - Modified deepresearch_workflow.py to pass images to agent
   - Updated deepresearch_agent.py to construct multimodal messages with image_url
   - Images are sent as base64 data URLs to vision-capable models (e.g., gpt-4o)
   - No changes needed to OpenAIEngine (natively supports multimodal messages)

2. Alignment Documentation:
   - Added ALIGNMENT_ANALYSIS.md with detailed comparison to Tongyi's DeepResearch
   - Updated README.md with source alignment mapping table

3. Code Cleanup:
   - Removed original reference files (react_agent_original.py, tool_*_original.py)
   - These were kept for reference but are now documented in ALIGNMENT_ANALYSIS.md
   - Added hle_outputs/* and intermediate files to .gitignore

Vision support enables the agent to process HLE questions with images (e.g., chess boards)
without requiring external file parsing, directly leveraging GPT-4o's vision capabilities.
…ve unused run_deepresearch_eval.py; print context limit once; align judge output & metrics
…acks; keep aligned with agent/workflow changes
@yayashuxue yayashuxue changed the title Feature/deepresearch integration Feat: deepresearch integration Oct 6, 2025
@@ -0,0 +1,260 @@
# DeepResearch Integration for rLLM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an official score running the model on HLE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean the tongyi model? i don't have the model spun up but if we do we can run the full hle and get the score). for the GPT o3 15 samples we got 26.7% on HLE

yayashuxue and others added 2 commits October 7, 2025 20:27
…t directory

- Simplified unsupported parameter handling in OpenAIEngine from 210 to 132 lines
- Removed complex parse_openai_error_for_unsupported_param function and duplicate code
- Extracted common logic into single _fix_unsupported_param helper method
- Fixed HLE evaluation script to always output to examples/deepresearch/hle_outputs/
- Ensures outputs go to gitignored location regardless of where script is run

This addresses reviewer feedback about overly complex error handling with code duplication.
Tested with GPT-4o and O3-mini models.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Remove .ruff.toml (not needed, project uses global config)
- Remove ALIGNMENT_ANALYSIS.md (internal development notes)

These were temporary files used during development to track
alignment with Tongyi's original implementation.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants