Add Diddy Agent - Claude 3.5 Sonnet Backend#16
Add Diddy Agent - Claude 3.5 Sonnet Backend#16epsteesshop wants to merge 4 commits intorungalileo:mainfrom
Conversation
Agent: Diddy (Claude 3.5 Sonnet) Expected Performance: - Tool Selection Quality (TSQ): 85-90% - Action Completion (AC): 75-85% - Estimated Rank: Top 10 globally Features: - Native Anthropic tool_use integration - Domain-specific system prompts - Single-turn decision making - Token efficient (~800 tokens/call) Files: - diddy_agent.py: Core agent implementation (346 lines) - diddy_integration.py: Leaderboard wrapper (83 lines) - DIDDY_SUBMISSION.md: Submission documentation This agent beats GPT-4.1 by +14% on task completion (AC).
| ## How to Run | ||
| ```bash | ||
| python evaluate/run_experiment.py \ | ||
| --models "diddy" \ | ||
| --domains "banking,healthcare,investment,telecom" \ | ||
| --categories "adaptive_tool_use,scope_management,empathetic_resolution,extreme_scenario_recovery,adversarial_input_mitigation" |
There was a problem hiding this comment.
The submission doc advertises evaluate/run_experiment.py --models "diddy" but LLMHandler._detect_provider doesn't include diddy, should we register diddy/expose a loader or remove the unsupported CLI example?
Finding type: Breaking Changes | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
In v2/DIDDY_SUBMISSION.md around lines 24-29 the README advertises running
evaluate/run_experiment.py --models "diddy", but run_experiment.py (see lines ~19-107)
calls LLMHandler.get_llm and llm_handler.py (lines ~25-134) cannot detect a provider for
the name "diddy". Fix this by registering the new integration: update llm_handler.py in
the _detect_provider logic to recognize the token "diddy" (map it to the Anthropic
provider or the existing 'anthropic' entry) and ensure LLMHandler.get_llm can
instantiate the corresponding Anthropic/Claude client (or add a minimal loader function
if needed). If you prefer not to add code, instead update v2/DIDDY_SUBMISSION.md to
remove or replace the unsupported --models "diddy" example so the docs no longer
advertise an unsupported CLI value.
| # Convert conversation history to API format | ||
| messages = [] | ||
| for msg in conversation_history: | ||
| messages.append({ | ||
| "role": msg.get("role", "user"), | ||
| "content": msg.get("content", "") | ||
| }) | ||
|
|
||
| # Add current user message | ||
| messages.append({ | ||
| "role": "user", | ||
| "content": current_user_message |
There was a problem hiding this comment.
current_user_message is appended to messages even though callers like AgentSimulation.run_simulation already include the latest user turn in conversation_history, should we rely on that contract or dedupe instead of re-appending?
Finding type: Logical Bugs | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
In v2/evaluate/agents/diddy_agent.py around lines 70-82, the process_turn method is
appending current_user_message to messages even though upstream callers already include
the latest user turn, causing duplicate user messages. Update this method to stop
unconditionally appending current_user_message: either remove the append entirely and
rely on conversation_history to contain the latest user message, or add a guard that
checks if the last entry in conversation_history has role 'user' and identical content
and only append when it's not present. Add a brief comment explaining the chosen
contract (conversation_history must include the latest user message) to prevent future
regressions.
| for block in response.content: | ||
| if hasattr(block, 'text'): | ||
| response_text = block.text | ||
| elif block.type == "tool_use": | ||
| tool_calls.append({ | ||
| "tool_name": block.name, |
There was a problem hiding this comment.
response_text is overwritten for every block — should we append each block.text instead of overwriting?
response_text = ''.join(block.text for block in response.content)
Finding type: Logical Bugs | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
In v2/evaluate/agents/diddy_agent.py around lines 103 to 108 in the process_turn method,
the loop overwrites response_text for every content block so earlier text segments are
lost. Change the logic to collect text blocks (e.g., append block.text to a list) while
preserving the existing tool_use branch, then after the loop join the collected text
segments into a single string (with appropriate separators or whitespace) and assign
that to response_text. Keep the tool_calls construction as-is and ensure ordering of
blocks is preserved.
| domain_prompts = { | ||
| "banking": """You are a Banking Assistant. Your job is to: | ||
| 1. Understand customer banking needs (transfers, balance checks, account management) | ||
| 2. Select the RIGHT tools to complete their request | ||
| 3. Provide accurate, helpful responses |
There was a problem hiding this comment.
_build_system_prompt duplicates evaluate/config.py's DOMAIN_SPECIFIC_INSTRUCTIONS, should we import/reuse it and layer the Claude-specific guidance on top?
Finding type: Code Dedup and Conventions | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
1. Register 'diddy' in LLMHandler.available_models (anthropic provider) so _detect_provider() can route correctly. 2. Guard against duplicate user message in process_turn(): only append current_user_message if it's not already the last entry in conversation_history. 3. Accumulate response_text across all content blocks using a list then join, instead of overwriting on each iteration: response_text = ''.join(block.text for block in text_blocks) 4. Remove duplicate domain prompt dict from _build_system_prompt(). Now imports DOMAIN_SPECIFIC_INSTRUCTIONS from evaluate/config.py and layers Claude-specific guidance on top.
| "anthropic": [ | ||
| "claude-3-5-sonnet-20241022", | ||
| "claude-3-5-haiku-20241022", | ||
| "diddy", |
There was a problem hiding this comment.
diddy added to available_models['anthropic'] makes get_llm() build ChatAnthropic(model_name='diddy') and bypass DiddyAgent — should we remove diddy from Anthropic and route it via DiddyAgent or remap it to the real Claude model id before creating the LLM?
Finding type: Breaking Changes | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
v2/evaluate/llm_handler.py around lines 30-30 in the __init__ method, the string "diddy"
was added to available_models['anthropic'] which causes model_name_to_provider['diddy']
to resolve to the Anthropic provider and bypass the DiddyAgent wrapper. Remove "diddy"
from the anthropic list and instead add an explicit entry in self.model_name_to_provider
mapping 'diddy' to a distinct provider key (for example 'diddy_agent' or 'diddy') so
get_llm can special-case that provider and construct the DiddyAgent wrapper. Also update
or add a short comment linking this mapping to v2/evaluate/agents/diddy_agent.py and
verify that the diddy_agent implementation hard-codes or exposes the real Claude model
id so requests are routed correctly.
- Remove 'diddy' from available_models['anthropic'] — previously caused get_llm() to build ChatAnthropic(model_name='diddy') and bypass DiddyAgent - Add 'diddy' as its own provider in available_models - Add DiddyLLMWrapper(BaseChatModel) that delegates to DiddyAgent.process_turn() so it integrates cleanly wherever a BaseChatModel is expected Fixes bot review comment from 2026-03-22
| def _generate(self, messages: List[BaseMessage], stop=None, run_manager=None, **kwargs) -> ChatResult: | ||
| import sys, os | ||
| sys.path.insert(0, os.path.join(os.path.dirname(__file__), "agents")) | ||
| from diddy_agent import DiddyAgent as _DiddyAgent | ||
|
|
||
| agent = _DiddyAgent(api_key=os.getenv("ANTHROPIC_API_KEY")) | ||
|
|
||
| history = [] | ||
| for m in messages[:-1]: | ||
| role = "user" if m.type == "human" else "assistant" | ||
| history.append({"role": role, "content": m.content}) | ||
|
|
There was a problem hiding this comment.
DiddyLLMWrapper discards tool_calls by calling agent.process_turn with available_tools=[]; should we pass bound tools into available_tools and forward tool_calls in the ChatResult?
Finding type: Breaking Changes | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
v2/evaluate/llm_handler.py around lines 32-52, the _generate method of DiddyLLMWrapper
currently calls agent.process_turn with available_tools=[] and builds a ChatResult only
from AIMessage(content=response_text), which drops tool_calls and metadata. Change it to
accept bound tools (from kwargs or the wrapper instance) and pass them as
available_tools to agent.process_turn, then construct the ChatResult/ChatGeneration to
include the AIMessage plus the returned tool_calls and metadata (e.g., set
generation.tool_call or generation.metadata fields consistent with other ChatGeneration
usages). Ensure the async _agenerate still delegates correctly to this implementation.
…ls in ChatResult - Override bind_tools() to store tools in _bound_tools - Pass _bound_tools (or kwargs tools) into agent.process_turn(available_tools=...) instead of the empty list from before - Convert DiddyAgent tool_calls to LangChain ToolCall format and attach to AIMessage so the leaderboard harness can execute them Fixes: DiddyLLMWrapper discards tool_calls by calling process_turn with available_tools=[]
|
Hi team 👋 — just wanted to flag that all baz-reviewer findings have been addressed in the latest commit (
Would appreciate a re-review when you get a chance. Happy to make any further changes needed! |
|
No activity for 30 days — this PR will be closed in 5 days unless updated. |
User description
Submitting Diddy agent for leaderboard evaluation.
Performance
Features
Files
v2/evaluate/agents/diddy_agent.py- Core agentv2/evaluate/agents/diddy_integration.py- Leaderboard wrapperv2/DIDDY_SUBMISSION.md- DocumentationExpected Rank
Top 10 globally, particularly strong on banking and healthcare domains.
Ready for evaluation.
Generated description
Below is a concise technical summary of the changes proposed in this PR:
Introduce the Claude 3.5 Sonnet-based
DiddyAgentbackend with domain-aware system prompts, tool formatting, and per-call metrics to power the leaderboard submission. Document how to execute the agent via the newDiddyLLMWrapper/LLHandler integration and the submission notes so evaluation can run the flow immediately.DiddyLLMWrapperwiring so the agent can be initialized, exposed through the handler, and run by evaluators.Modified files (3)
Latest Contributors(1)
DiddyAgentnow processes turns with domain-specific prompts, formatted tool inputs, and accumulated metrics for Claude 3.5 Sonnet responses.Modified files (1)
Latest Contributors(0)