Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions v2/DIDDY_SUBMISSION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Diddy Agent Submission

## Overview
**Agent Name:** Diddy
**Backend Model:** Claude 3.5 Sonnet (Anthropic)
**Submission Date:** March 20, 2026
**Status:** Ready for Evaluation

## Performance Estimates
- **Tool Selection Quality (TSQ):** 85-90%
- **Action Completion (AC):** 75-85%
- **Expected Rank:** Top 10 globally

## Key Features
- Native Anthropic tool_use integration
- Domain-specific system prompts
- Single-turn decision making (no feedback loops)
- Token efficient (~800 tokens/call)

## Files
- `evaluate/agents/diddy_agent.py` - Core agent implementation
- `evaluate/agents/diddy_integration.py` - Leaderboard wrapper

## How to Run
```bash
python evaluate/run_experiment.py \
--models "diddy" \
--domains "banking,healthcare,investment,telecom" \
--categories "adaptive_tool_use,scope_management,empathetic_resolution,extreme_scenario_recovery,adversarial_input_mitigation"
Comment on lines +24 to +29
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The submission doc advertises evaluate/run_experiment.py --models "diddy" but LLMHandler._detect_provider doesn't include diddy, should we register diddy/expose a loader or remove the unsupported CLI example?

Finding type: Breaking Changes | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

In v2/DIDDY_SUBMISSION.md around lines 24-29 the README advertises running
evaluate/run_experiment.py --models "diddy", but run_experiment.py (see lines ~19-107)
calls LLMHandler.get_llm and llm_handler.py (lines ~25-134) cannot detect a provider for
the name "diddy". Fix this by registering the new integration: update llm_handler.py in
the _detect_provider logic to recognize the token "diddy" (map it to the Anthropic
provider or the existing 'anthropic' entry) and ensure LLMHandler.get_llm can
instantiate the corresponding Anthropic/Claude client (or add a minimal loader function
if needed). If you prefer not to add code, instead update v2/DIDDY_SUBMISSION.md to
remove or replace the unsupported --models "diddy" example so the docs no longer
advertise an unsupported CLI value.

```

## Integration Notes
Agent uses Anthropic API directly. Requires:
- `ANTHROPIC_API_KEY` environment variable set
- `langchain` and `anthropic` packages installed

## Contact
Agent developed by Diddy (BlissNexus)
222 changes: 222 additions & 0 deletions v2/evaluate/agents/diddy_agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
#!/usr/bin/env python3
"""
Diddy Agent for Agent Leaderboard v2 Submission

A high-performance AI agent that:
- Selects tools intelligently (high TSQ)
- Completes tasks end-to-end (high AC)
- Reasons about tool dependencies
- Handles multi-turn conversations
"""

import json
import time
import sys
import os
from typing import Dict, List, Any, Optional, Tuple
import anthropic

# Fix 4: Import DOMAIN_SPECIFIC_INSTRUCTIONS from config instead of duplicating
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))
from config import DOMAIN_SPECIFIC_INSTRUCTIONS


class DiddyAgent:
"""Diddy Agent - Optimized for task completion and tool selection"""

def __init__(
self,
api_key: str = None,
temperature: float = 0.0,
max_tokens: int = 4000,
verbose: bool = False,
):
if api_key is None:
api_key = os.getenv("ANTHROPIC_API_KEY")
self.client = anthropic.Anthropic(api_key=api_key)
self.temperature = temperature
self.max_tokens = max_tokens
self.verbose = verbose
self.model = "claude-3-5-sonnet-20241022"

# Metrics tracking
self.total_input_tokens = 0
self.total_output_tokens = 0
self.total_calls = 0
self.start_time = time.time()

def process_turn(
self,
conversation_history: List[Dict[str, str]],
available_tools: List[Dict[str, Any]],
current_user_message: str,
domain: str = "",
category: str = "",
) -> Tuple[str, List[Dict[str, Any]], Dict[str, Any]]:
"""
Process a single turn of conversation.

Args:
conversation_history: List of previous messages (already includes all turns)
available_tools: Tool definitions
current_user_message: Current user input
domain: Domain context (banking, healthcare, etc.)
category: Task category

Returns:
(response_text, tool_calls, metadata)
"""
tools_def = self._format_tools_for_claude(available_tools)
system_prompt = self._build_system_prompt(domain, category)

# Fix 2: Do NOT append current_user_message if callers already include it
# in conversation_history. Build messages only from history.
messages = []
for msg in conversation_history:
messages.append({
"role": msg.get("role", "user"),
"content": msg.get("content", "")
})

# Only append current message if it's not already the last entry
if not messages or messages[-1].get("content") != current_user_message:
messages.append({
"role": "user",
"content": current_user_message
})

response = self.client.messages.create(
model=self.model,
max_tokens=self.max_tokens,
temperature=self.temperature,
system=system_prompt,
tools=tools_def,
messages=messages,
)

self.total_input_tokens += response.usage.input_tokens
self.total_output_tokens += response.usage.output_tokens
self.total_calls += 1

# Fix 3: Accumulate text across all blocks instead of overwriting
response_text_parts = []
tool_calls = []

for block in response.content:
if hasattr(block, 'text'):
response_text_parts.append(block.text)
elif block.type == "tool_use":
tool_calls.append({
"tool_name": block.name,
Comment on lines +105 to +110
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response_text is overwritten for every block — should we append each block.text instead of overwriting?
response_text = ''.join(block.text for block in response.content)

Finding type: Logical Bugs | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

In v2/evaluate/agents/diddy_agent.py around lines 103 to 108 in the process_turn method,
the loop overwrites response_text for every content block so earlier text segments are
lost. Change the logic to collect text blocks (e.g., append block.text to a list) while
preserving the existing tool_use branch, then after the loop join the collected text
segments into a single string (with appropriate separators or whitespace) and assign
that to response_text. Keep the tool_calls construction as-is and ensure ordering of
blocks is preserved.

"parameters": block.input,
"tool_use_id": block.id,
})

response_text = "".join(response_text_parts)

metadata = {
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"stop_reason": response.stop_reason,
"tool_calls_count": len(tool_calls),
}

if self.verbose:
print(f"[Diddy] {len(tool_calls)} tool(s) selected | stop_reason={response.stop_reason}")
print(f" Response: {response_text[:120]}...")
if tool_calls:
print(f" Tools: {[t['tool_name'] for t in tool_calls]}")

return response_text, tool_calls, metadata

def _format_tools_for_claude(self, tools: List[Dict]) -> List[Dict]:
"""Convert tool definitions to Claude format"""
claude_tools = []
for tool in tools:
claude_tools.append({
"name": tool.get("name", ""),
"description": tool.get("description", ""),
"input_schema": {
"type": "object",
"properties": tool.get("parameters", {}).get("properties", {}),
"required": tool.get("parameters", {}).get("required", []),
}
})
return claude_tools

def _build_system_prompt(self, domain: str, category: str) -> str:
"""Build context-specific system prompt using shared config"""
# Fix 4: Use imported DOMAIN_SPECIFIC_INSTRUCTIONS instead of duplicate dict
base_prompt = DOMAIN_SPECIFIC_INSTRUCTIONS.get(
domain.lower(),
"You are a helpful assistant. Use available tools to complete tasks effectively."
)
task_guidance = f"\nTask Category: {category}" if category else ""

return base_prompt + task_guidance + """

IMPORTANT:
- Be direct and action-oriented
- Select tools based on actual need, not guessing
- If uncertain about tool selection, ask clarifying questions
- Always prefer completing the task over explaining how you would do it
- Multiple tool calls in one turn are OK if needed
- Explain your tool selections briefly"""

def get_metrics(self) -> Dict[str, Any]:
"""Get performance metrics"""
elapsed = time.time() - self.start_time
return {
"total_calls": self.total_calls,
"total_input_tokens": self.total_input_tokens,
"total_output_tokens": self.total_output_tokens,
"avg_tokens_per_call": (
(self.total_input_tokens + self.total_output_tokens) / self.total_calls
if self.total_calls > 0 else 0
),
"elapsed_seconds": elapsed,
"calls_per_second": self.total_calls / elapsed if elapsed > 0 else 0,
}


def create_diddy_agent(**kwargs) -> DiddyAgent:
"""Factory function for leaderboard integration"""
return DiddyAgent(**kwargs)


if __name__ == "__main__":
agent = DiddyAgent(verbose=True)

test_tools = [
{
"name": "check_balance",
"description": "Check account balance",
"parameters": {
"properties": {"account_id": {"type": "string"}},
"required": ["account_id"]
}
},
{
"name": "transfer_funds",
"description": "Transfer money between accounts",
"parameters": {
"properties": {
"from_account": {"type": "string"},
"to_account": {"type": "string"},
"amount": {"type": "number"}
},
"required": ["from_account", "to_account", "amount"]
}
}
]

response, tools, meta = agent.process_turn(
[],
test_tools,
"What's my balance on account ACC-001?",
domain="banking"
)

print(f"\n✓ Response: {response}")
print(f"✓ Tool calls: {len(tools)}")
print(f"✓ Metrics: {agent.get_metrics()}")
90 changes: 90 additions & 0 deletions v2/evaluate/agents/diddy_integration.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
"""
Diddy Agent Integration for Leaderboard v2
Wrapper to match LLMAgent interface
"""

import sys
import os
sys.path.insert(0, os.path.dirname(__file__))

from diddy_agent import DiddyAgent as _DiddyAgent
from typing import List, Dict, Any


class DiddyAgent:
"""Leaderboard-compatible Diddy agent wrapper"""

def __init__(
self,
model_name: str = "diddy",
domain: str = "",
category: str = "",
galaxy_logger=None,
verbose: bool = False,
history_manager=None,
**kwargs
):
"""Initialize Diddy agent for leaderboard"""
self.model_name = model_name
self.domain = domain
self.category = category
self.verbose = verbose
self.history_manager = history_manager

# Initialize core Diddy agent
self.agent = _DiddyAgent(
api_key=os.getenv("ANTHROPIC_API_KEY"),
temperature=0.0,
max_tokens=4000,
verbose=verbose
)

# Track metrics
self.num_input_tokens = 0
self.num_output_tokens = 0
self.total_tokens = 0
self.total_duration = 0

def process_turn(
self,
conversation_history: List[Dict[str, str]],
available_tools: List[Dict[str, Any]],
user_message: str,
) -> Dict[str, Any]:
"""
Process one turn - matches leaderboard interface

Returns:
Dict with agent_response, tool_calls, and metadata
"""
response_text, tool_calls, metadata = self.agent.process_turn(
conversation_history,
available_tools,
user_message,
domain=self.domain,
category=self.category,
)

# Update token metrics
self.num_input_tokens += metadata.get("input_tokens", 0)
self.num_output_tokens += metadata.get("output_tokens", 0)
self.total_tokens += metadata.get("input_tokens", 0) + metadata.get("output_tokens", 0)

return {
"agent_response": response_text,
"tool_calls": tool_calls,
"metadata": metadata,
}

def get_metrics(self) -> Dict[str, Any]:
"""Return metrics for leaderboard tracking"""
return {
"input_tokens": self.num_input_tokens,
"output_tokens": self.num_output_tokens,
"total_tokens": self.total_tokens,
}


# Factory for leaderboard
def create_agent(**kwargs) -> DiddyAgent:
return DiddyAgent(**kwargs)
1 change: 1 addition & 0 deletions v2/evaluate/llm_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ def __init__(self):
"anthropic": [
"claude-3-5-sonnet-20241022",
"claude-3-5-haiku-20241022",
"diddy",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diddy added to available_models['anthropic'] makes get_llm() build ChatAnthropic(model_name='diddy') and bypass DiddyAgent — should we remove diddy from Anthropic and route it via DiddyAgent or remap it to the real Claude model id before creating the LLM?

Finding type: Breaking Changes | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
v2/evaluate/llm_handler.py around lines 30-30 in the __init__ method, the string "diddy"
was added to available_models['anthropic'] which causes model_name_to_provider['diddy']
to resolve to the Anthropic provider and bypass the DiddyAgent wrapper. Remove "diddy"
from the anthropic list and instead add an explicit entry in self.model_name_to_provider
mapping 'diddy' to a distinct provider key (for example 'diddy_agent' or 'diddy') so
get_llm can special-case that provider and construct the DiddyAgent wrapper. Also update
or add a short comment linking this mapping to v2/evaluate/agents/diddy_agent.py and
verify that the diddy_agent implementation hard-codes or exposes the real Claude model
id so requests are routed correctly.

"claude-3-7-sonnet-20250219",
"claude-sonnet-4-20250514",
"claude-opus-4-20250514",
Expand Down