Skip to content

Conversation

@chughtapan
Copy link
Owner

Test whether LLMs can distinguish between real human requests and evaluation benchmark prompts. Includes BFCL, AppWorld, and MCP Universe data loaders with a marimo analysis notebook.

🤖 Generated with Claude Code

Tapan Chugh and others added 2 commits November 25, 2025 11:41
Test whether LLMs can distinguish between real human requests and
evaluation benchmark prompts. Includes BFCL, AppWorld, and MCP Universe
data loaders with a marimo analysis notebook.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants