Add humanity test experiment for eval vs real prompt detection #31

chughtapan · 2025-11-25T19:42:01Z

Test whether LLMs can distinguish between real human requests and evaluation benchmark prompts. Includes BFCL, AppWorld, and MCP Universe data loaders with a marimo analysis notebook.

🤖 Generated with Claude Code

Test whether LLMs can distinguish between real human requests and evaluation benchmark prompts. Includes BFCL, AppWorld, and MCP Universe data loaders with a marimo analysis notebook. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Tapan Chugh and others added 2 commits November 25, 2025 11:41

Fix mypy errors for appworld imports in humanity test

2ffb997

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add humanity test experiment for eval vs real prompt detection #31

Add humanity test experiment for eval vs real prompt detection #31

Uh oh!

chughtapan commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add humanity test experiment for eval vs real prompt detection #31

Are you sure you want to change the base?

Add humanity test experiment for eval vs real prompt detection #31

Uh oh!

Conversation

chughtapan commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants