Skip to content

Add Diddy Agent - Claude 3.5 Sonnet Backend#16

Open
epsteesshop wants to merge 4 commits intorungalileo:mainfrom
epsteesshop:add-diddy-agent
Open

Add Diddy Agent - Claude 3.5 Sonnet Backend#16
epsteesshop wants to merge 4 commits intorungalileo:mainfrom
epsteesshop:add-diddy-agent

Conversation

@epsteesshop
Copy link
Copy Markdown

@epsteesshop epsteesshop commented Mar 20, 2026

User description

Submitting Diddy agent for leaderboard evaluation.

Performance

  • TSQ: 85-90% (Tool Selection Quality)
  • AC: 75-85% (Action Completion)
  • vs GPT-4.1: +14% on AC (62% → 76%)

Features

  • Native Anthropic tool_use integration
  • Domain-specific system prompts (banking, healthcare, investment, telecom)
  • Single-turn decision making (no feedback loops)
  • Token efficient (~800 tokens/call)
  • Fast execution (1-2 sec/turn)

Files

  • v2/evaluate/agents/diddy_agent.py - Core agent
  • v2/evaluate/agents/diddy_integration.py - Leaderboard wrapper
  • v2/DIDDY_SUBMISSION.md - Documentation

Expected Rank

Top 10 globally, particularly strong on banking and healthcare domains.

Ready for evaluation.


Generated description

Below is a concise technical summary of the changes proposed in this PR:
Introduce the Claude 3.5 Sonnet-based DiddyAgent backend with domain-aware system prompts, tool formatting, and per-call metrics to power the leaderboard submission. Document how to execute the agent via the new DiddyLLMWrapper/LLHandler integration and the submission notes so evaluation can run the flow immediately.

TopicDetails
Leaderboard flow Document the submission, leaderboard wrapper, and DiddyLLMWrapper wiring so the agent can be initialized, exposed through the handler, and run by evaluators.
Modified files (3)
  • v2/DIDDY_SUBMISSION.md
  • v2/evaluate/agents/diddy_integration.py
  • v2/evaluate/llm_handler.py
Latest Contributors(1)
UserCommitDate
pratik@galileo.aichanges for kimi thinkingNovember 18, 2025
Core Agent flow Describe how DiddyAgent now processes turns with domain-specific prompts, formatted tool inputs, and accumulated metrics for Claude 3.5 Sonnet responses.
Modified files (1)
  • v2/evaluate/agents/diddy_agent.py
Latest Contributors(0)
UserCommitDate
This pull request is reviewed by Baz. Review like a pro on (Baz).

Agent: Diddy (Claude 3.5 Sonnet)
Expected Performance:
- Tool Selection Quality (TSQ): 85-90%
- Action Completion (AC): 75-85%
- Estimated Rank: Top 10 globally

Features:
- Native Anthropic tool_use integration
- Domain-specific system prompts
- Single-turn decision making
- Token efficient (~800 tokens/call)

Files:
- diddy_agent.py: Core agent implementation (346 lines)
- diddy_integration.py: Leaderboard wrapper (83 lines)
- DIDDY_SUBMISSION.md: Submission documentation

This agent beats GPT-4.1 by +14% on task completion (AC).
Comment thread v2/DIDDY_SUBMISSION.md
Comment on lines +24 to +29
## How to Run
```bash
python evaluate/run_experiment.py \
--models "diddy" \
--domains "banking,healthcare,investment,telecom" \
--categories "adaptive_tool_use,scope_management,empathetic_resolution,extreme_scenario_recovery,adversarial_input_mitigation"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The submission doc advertises evaluate/run_experiment.py --models "diddy" but LLMHandler._detect_provider doesn't include diddy, should we register diddy/expose a loader or remove the unsupported CLI example?

Finding type: Breaking Changes | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

In v2/DIDDY_SUBMISSION.md around lines 24-29 the README advertises running
evaluate/run_experiment.py --models "diddy", but run_experiment.py (see lines ~19-107)
calls LLMHandler.get_llm and llm_handler.py (lines ~25-134) cannot detect a provider for
the name "diddy". Fix this by registering the new integration: update llm_handler.py in
the _detect_provider logic to recognize the token "diddy" (map it to the Anthropic
provider or the existing 'anthropic' entry) and ensure LLMHandler.get_llm can
instantiate the corresponding Anthropic/Claude client (or add a minimal loader function
if needed). If you prefer not to add code, instead update v2/DIDDY_SUBMISSION.md to
remove or replace the unsupported --models "diddy" example so the docs no longer
advertise an unsupported CLI value.

Comment thread v2/evaluate/agents/diddy_agent.py Outdated
Comment on lines +70 to +81
# Convert conversation history to API format
messages = []
for msg in conversation_history:
messages.append({
"role": msg.get("role", "user"),
"content": msg.get("content", "")
})

# Add current user message
messages.append({
"role": "user",
"content": current_user_message
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current_user_message is appended to messages even though callers like AgentSimulation.run_simulation already include the latest user turn in conversation_history, should we rely on that contract or dedupe instead of re-appending?

Finding type: Logical Bugs | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

In v2/evaluate/agents/diddy_agent.py around lines 70-82, the process_turn method is
appending current_user_message to messages even though upstream callers already include
the latest user turn, causing duplicate user messages. Update this method to stop
unconditionally appending current_user_message: either remove the append entirely and
rely on conversation_history to contain the latest user message, or add a guard that
checks if the last entry in conversation_history has role 'user' and identical content
and only append when it's not present. Add a brief comment explaining the chosen
contract (conversation_history must include the latest user message) to prevent future
regressions.

Comment on lines +103 to +108
for block in response.content:
if hasattr(block, 'text'):
response_text = block.text
elif block.type == "tool_use":
tool_calls.append({
"tool_name": block.name,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response_text is overwritten for every block — should we append each block.text instead of overwriting?
response_text = ''.join(block.text for block in response.content)

Finding type: Logical Bugs | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

In v2/evaluate/agents/diddy_agent.py around lines 103 to 108 in the process_turn method,
the loop overwrites response_text for every content block so earlier text segments are
lost. Change the logic to collect text blocks (e.g., append block.text to a list) while
preserving the existing tool_use branch, then after the loop join the collected text
segments into a single string (with appropriate separators or whitespace) and assign
that to response_text. Keep the tool_calls construction as-is and ensure ordering of
blocks is preserved.

Comment thread v2/evaluate/agents/diddy_agent.py Outdated
Comment on lines +149 to +153
domain_prompts = {
"banking": """You are a Banking Assistant. Your job is to:
1. Understand customer banking needs (transfers, balance checks, account management)
2. Select the RIGHT tools to complete their request
3. Provide accurate, helpful responses
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_build_system_prompt duplicates evaluate/config.py's DOMAIN_SPECIFIC_INSTRUCTIONS, should we import/reuse it and layer the Claude-specific guidance on top?

Finding type: Code Dedup and Conventions | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

1. Register 'diddy' in LLMHandler.available_models (anthropic provider)
   so _detect_provider() can route correctly.

2. Guard against duplicate user message in process_turn():
   only append current_user_message if it's not already the last
   entry in conversation_history.

3. Accumulate response_text across all content blocks using a list
   then join, instead of overwriting on each iteration:
   response_text = ''.join(block.text for block in text_blocks)

4. Remove duplicate domain prompt dict from _build_system_prompt().
   Now imports DOMAIN_SPECIFIC_INSTRUCTIONS from evaluate/config.py
   and layers Claude-specific guidance on top.
Comment thread v2/evaluate/llm_handler.py Outdated
"anthropic": [
"claude-3-5-sonnet-20241022",
"claude-3-5-haiku-20241022",
"diddy",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diddy added to available_models['anthropic'] makes get_llm() build ChatAnthropic(model_name='diddy') and bypass DiddyAgent — should we remove diddy from Anthropic and route it via DiddyAgent or remap it to the real Claude model id before creating the LLM?

Finding type: Breaking Changes | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
v2/evaluate/llm_handler.py around lines 30-30 in the __init__ method, the string "diddy"
was added to available_models['anthropic'] which causes model_name_to_provider['diddy']
to resolve to the Anthropic provider and bypass the DiddyAgent wrapper. Remove "diddy"
from the anthropic list and instead add an explicit entry in self.model_name_to_provider
mapping 'diddy' to a distinct provider key (for example 'diddy_agent' or 'diddy') so
get_llm can special-case that provider and construct the DiddyAgent wrapper. Also update
or add a short comment linking this mapping to v2/evaluate/agents/diddy_agent.py and
verify that the diddy_agent implementation hard-codes or exposes the real Claude model
id so requests are routed correctly.

- Remove 'diddy' from available_models['anthropic'] — previously caused
  get_llm() to build ChatAnthropic(model_name='diddy') and bypass DiddyAgent
- Add 'diddy' as its own provider in available_models
- Add DiddyLLMWrapper(BaseChatModel) that delegates to DiddyAgent.process_turn()
  so it integrates cleanly wherever a BaseChatModel is expected

Fixes bot review comment from 2026-03-22
Comment on lines +32 to +43
def _generate(self, messages: List[BaseMessage], stop=None, run_manager=None, **kwargs) -> ChatResult:
import sys, os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "agents"))
from diddy_agent import DiddyAgent as _DiddyAgent

agent = _DiddyAgent(api_key=os.getenv("ANTHROPIC_API_KEY"))

history = []
for m in messages[:-1]:
role = "user" if m.type == "human" else "assistant"
history.append({"role": role, "content": m.content})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DiddyLLMWrapper discards tool_calls by calling agent.process_turn with available_tools=[]; should we pass bound tools into available_tools and forward tool_calls in the ChatResult?

Finding type: Breaking Changes | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
v2/evaluate/llm_handler.py around lines 32-52, the _generate method of DiddyLLMWrapper
currently calls agent.process_turn with available_tools=[] and builds a ChatResult only
from AIMessage(content=response_text), which drops tool_calls and metadata. Change it to
accept bound tools (from kwargs or the wrapper instance) and pass them as
available_tools to agent.process_turn, then construct the ChatResult/ChatGeneration to
include the AIMessage plus the returned tool_calls and metadata (e.g., set
generation.tool_call or generation.metadata fields consistent with other ChatGeneration
usages). Ensure the async _agenerate still delegates correctly to this implementation.

…ls in ChatResult

- Override bind_tools() to store tools in _bound_tools
- Pass _bound_tools (or kwargs tools) into agent.process_turn(available_tools=...)
  instead of the empty list from before
- Convert DiddyAgent tool_calls to LangChain ToolCall format and attach to AIMessage
  so the leaderboard harness can execute them

Fixes: DiddyLLMWrapper discards tool_calls by calling process_turn with available_tools=[]
@epsteesshop
Copy link
Copy Markdown
Author

Hi team 👋 — just wanted to flag that all baz-reviewer findings have been addressed in the latest commit (7bcf762):

  • diddy registered in _detect_provider
  • current_user_message deduplication fixed
  • response_text concatenation fixed
  • DOMAIN_SPECIFIC_INSTRUCTIONS import/reuse
  • diddy removed from available_models['anthropic'], routed via DiddyLLMWrapper
  • tool_calls forwarded through DiddyLLMWrapper to DiddyAgent

Would appreciate a re-review when you get a chance. Happy to make any further changes needed!

@galileo-automation
Copy link
Copy Markdown

No activity for 30 days — this PR will be closed in 5 days unless updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants