-
Notifications
You must be signed in to change notification settings - Fork 45
docs: judge and user simulator #206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 10 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
cd2a0fd
chore: add scenario metadata replacer
Aryansharma28 f8a26d6
fix: filter metadata updates to scenario docs
Aryansharma28 1a8122c
feat: map scenario URLs to frontmatter descriptions
Aryansharma28 08e8227
Merge branch 'main' of https://github.com/langwatch/scenario
Aryansharma28 35a0d2c
Merge branch 'main' of https://github.com/langwatch/scenario
Aryansharma28 82c2a27
Merge branch 'main' of https://github.com/langwatch/scenario
Aryansharma28 ab1cf22
Merge branch 'main' of https://github.com/langwatch/scenario
Aryansharma28 2832ca2
judge agent
Aryansharma28 dbfc038
user-simulator
Aryansharma28 6dd38fc
fix: lint fix
Aryansharma28 4133d35
docs: table fixes
Aryansharma28 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,160 @@ | ||
| --- | ||
| title: Judge Agent - Automated Test Evaluation | ||
| description: Master the Judge Agent for automated scenario evaluation. Comprehensive guide covering configuration options, evaluation criteria design, custom system prompts, and best practices for reliable AI agent testing across Python and TypeScript. | ||
| --- | ||
|
|
||
| import { Callout } from "vocs/components"; | ||
|
|
||
| # Judge Agent | ||
|
|
||
| ## Overview | ||
|
|
||
| The **Judge Agent** is an LLM-powered evaluator that automatically determines whether your agent under test meets defined success criteria. Instead of writing complex assertion logic, you describe what success looks like in natural language, and the judge evaluates each conversation turn to decide whether to continue, succeed, or fail the test. | ||
|
|
||
| After each agent response, the judge: | ||
|
|
||
| 1. **Reviews** the entire conversation history | ||
| 2. **Evaluates** against your defined criteria | ||
| 3. **Decides** whether to continue, succeed, or fail | ||
|
|
||
| --- | ||
|
|
||
| ## Use Case Example | ||
|
|
||
| Let's test a customer support agent handling billing inquiries: | ||
|
|
||
| :::code-group | ||
|
|
||
| ```python [python] | ||
| import pytest | ||
| import scenario | ||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_billing_inquiry_quality(): | ||
| result = await scenario.run( | ||
| name="billing inquiry handling", | ||
| description=""" | ||
| User received an unexpected charge on their credit card and is | ||
| concerned but polite. They have their account information ready. | ||
| """, | ||
| agents=[ | ||
| CustomerSupportAgent(), | ||
| scenario.UserSimulatorAgent(), | ||
| scenario.JudgeAgent(criteria=[ | ||
| "Agent asks for account information to investigate", | ||
| "Agent explains the charge clearly", | ||
| "Agent offers a solution or next steps", | ||
| "Agent maintains a helpful and empathetic tone", | ||
| "Agent should not make promises about refunds without verification" | ||
| ]) | ||
| ], | ||
| max_turns=8 | ||
| ) | ||
|
|
||
| assert result.success | ||
| ``` | ||
|
|
||
| ```typescript [typescript] | ||
| import scenario from "@langwatch/scenario"; | ||
| import { openai } from "@ai-sdk/openai"; | ||
|
|
||
| const result = await scenario.run({ | ||
| name: "billing inquiry handling", | ||
| description: ` | ||
| User received an unexpected charge on their credit card and is | ||
| concerned but polite. They have their account information ready. | ||
| `, | ||
| agents: [ | ||
| customerSupportAgent, | ||
| scenario.userSimulatorAgent({ model: openai("gpt-4o") }), | ||
| scenario.judgeAgent({ | ||
| model: openai("gpt-4o"), | ||
| criteria: [ | ||
| "Agent asks for account information to investigate", | ||
| "Agent explains the charge clearly", | ||
| "Agent offers a solution or next steps", | ||
| "Agent maintains a helpful and empathetic tone", | ||
| "Agent should not make promises about refunds without verification", | ||
| ], | ||
| }), | ||
| ], | ||
| maxTurns: 8, | ||
| }); | ||
|
|
||
| expect(result.success).toBe(true); | ||
| ``` | ||
|
|
||
| ::: | ||
|
|
||
| --- | ||
|
|
||
| ## Configuration Reference | ||
|
|
||
| <Callout type="info"> | ||
| All parameters are optional except `criteria`. The judge uses global configuration defaults when parameters aren't specified. | ||
| </Callout> | ||
|
|
||
| | Parameter | Type (Python / TypeScript) | Required | Default | Description | | ||
| |-----------|---------------------------|----------|---------|-------------| | ||
| | `criteria` | `List[str]` / `string[]` | TS: Yes<br/>PY: No | `[]` | Success criteria to evaluate. Include positive requirements and negative constraints. | | ||
| | `model` | `str` / `LanguageModel` | No | Global config | LLM model identifier (PY: `"openai/gpt-4o"`, TS: `openai("gpt-4o")`). | | ||
| | `temperature` | `float` / `number` | No | `0.0` | Sampling temperature (0.0-1.0). Use 0.0-0.2 for consistent evaluation. | | ||
| | `max_tokens` / `maxTokens` | `int` / `number` | No | Model default | Maximum tokens for judge reasoning and explanations. | | ||
| | `system_prompt` / `systemPrompt` | `str` / `string` | No | Built-in | Custom system prompt to override default judge behavior. | | ||
| | `api_base` | `str` | No | Global config | **Python only**: Base URL for custom API endpoints. | | ||
| | `api_key` | `str` | No | Environment | **Python only**: API key for the model provider. | | ||
| | `**extra_params` | `dict` | No | `{}` | **Python only**: Additional [LiteLLM parameters](https://docs.litellm.ai/docs/completion/input) (`headers`, `timeout`, `client`). | | ||
| | `name` | `string` | No | `"Judge"` | **TypeScript only**: Display name in logs and traces. | | ||
|
|
||
| --- | ||
|
|
||
| ## Writing Effective Criteria | ||
|
|
||
| Good criteria are **specific**, **measurable**, **relevant**, and **actionable**: | ||
|
|
||
| :::code-group | ||
|
|
||
| ```python [python] | ||
| # Good - specific and measurable | ||
| scenario.JudgeAgent(criteria=[ | ||
| "Agent asks for the user's order number", | ||
| "Agent provides a tracking link", | ||
| "Agent offers to help with anything else", | ||
| "Agent should not promise delivery dates without checking the system" | ||
| ]) | ||
|
|
||
| # Avoid vague criteria | ||
| scenario.JudgeAgent(criteria=[ | ||
| "Agent is helpful", # Too vague | ||
| "Agent does everything right", # Not measurable | ||
| ]) | ||
| ``` | ||
|
|
||
| ```typescript [typescript] | ||
| // Good - specific and measurable | ||
| scenario.judgeAgent({ | ||
| criteria: [ | ||
| "Agent asks for the user's order number", | ||
| "Agent provides a tracking link", | ||
| "Agent offers to help with anything else", | ||
| "Agent should not promise delivery dates without checking the system", | ||
| ], | ||
| }); | ||
|
|
||
| // Avoid vague criteria | ||
| scenario.judgeAgent({ | ||
| criteria: [ | ||
| "Agent is helpful", // Too vague | ||
| "Agent does everything right", // Not measurable | ||
| ], | ||
| }); | ||
| ``` | ||
|
|
||
| ::: | ||
|
|
||
| ## Next Steps | ||
|
|
||
| - [User Simulator Agent](/basics/user-simulator) - Configure realistic user behavior | ||
| - [Writing Scenarios](/basics/writing-scenarios) - Best practices for scenario design | ||
| - [Scripted Simulations](/basics/scripted-simulations) - Combine judges with precise flow control | ||
| - [Configuration](/basics/configuration) - Set global defaults for all judges | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,162 @@ | ||
| --- | ||
| title: User Simulator Agent - Realistic Test Interactions | ||
| description: Master the User Simulator Agent for realistic agent testing. Complete guide covering configuration options, persona customization, behavior patterns, and best practices for simulating diverse user interactions in Python and TypeScript. | ||
| --- | ||
|
|
||
| import { Callout } from "vocs/components"; | ||
|
|
||
| # User Simulator Agent | ||
|
|
||
| ## Overview | ||
|
|
||
| The **User Simulator Agent** is an LLM-powered agent that simulates realistic user behavior during scenario tests. Instead of writing scripted user messages, you describe the user's context and goals, and the simulator generates natural, contextually appropriate messages that drive the conversation forward. | ||
|
|
||
| ### When to Use the User Simulator | ||
|
|
||
| The User Simulator is ideal for: | ||
|
|
||
| - **Automatic Testing**: Let the conversation unfold naturally without scripting every message | ||
| - **Diverse Scenarios**: Test how your agent handles different user personalities and communication styles | ||
| - **Edge Cases**: Explore unexpected user behaviors and responses | ||
| - **Multi-Turn Conversations**: Simulate realistic back-and-forth interactions | ||
|
|
||
| ### How It Works | ||
|
|
||
| The user simulator: | ||
|
|
||
| 1. **Reads** the scenario description and conversation history | ||
| 2. **Generates** a natural user message based on context | ||
| 3. **Adapts** its communication style to match the described persona | ||
| 4. **Responds** realistically to the agent's messages | ||
|
|
||
| --- | ||
|
|
||
| ## Use Case Example: Testing a Frustrated Customer | ||
|
|
||
| Let's test how a support agent handles an increasingly frustrated customer: | ||
|
|
||
| :::code-group | ||
|
|
||
| ```python [python] | ||
| import pytest | ||
| import scenario | ||
|
|
||
| class TechnicalSupportAgent(scenario.AgentAdapter): | ||
| async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes: | ||
| # Your technical support agent implementation | ||
| user_message = input.last_new_user_message_str() | ||
| return await my_support_bot.process(user_message) | ||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_frustrated_customer_handling(): | ||
| result = await scenario.run( | ||
| name="frustrated customer with internet issues", | ||
| description=""" | ||
| User is a non-technical person experiencing slow internet for 3 days. | ||
| They've already tried calling support twice with no resolution. | ||
| They're frustrated and tired of technical jargon. They just want | ||
| their internet to work and are losing patience with troubleshooting steps. | ||
| """, | ||
| agents=[ | ||
| TechnicalSupportAgent(), | ||
| scenario.UserSimulatorAgent( | ||
| model="openai/gpt-4o", | ||
| temperature=0.3 # Some variability for realistic frustration | ||
| ), | ||
| scenario.JudgeAgent(criteria=[ | ||
| "Agent acknowledges the customer's frustration empathetically", | ||
| "Agent avoids excessive technical jargon", | ||
| "Agent provides simple, clear instructions", | ||
| "Agent offers escalation if troubleshooting doesn't work", | ||
| "Agent remains professional despite customer frustration" | ||
| ]) | ||
| ], | ||
| max_turns=10 | ||
| ) | ||
|
|
||
| assert result.success | ||
| print(f"Test completed with {len(result.messages)} messages") | ||
| ``` | ||
|
|
||
| ```typescript [typescript] | ||
| import scenario, { type AgentAdapter, AgentRole } from "@langwatch/scenario"; | ||
| import { describe, it, expect } from "vitest"; | ||
| import { openai } from "@ai-sdk/openai"; | ||
|
|
||
| const technicalSupportAgent: AgentAdapter = { | ||
| role: AgentRole.AGENT, | ||
| async call(input) { | ||
| // Your technical support agent implementation | ||
| const userMessage = input.messages.filter((m) => m.role === "user").at(-1)?.content; | ||
| return await mySupportBot.process(userMessage); | ||
| }, | ||
| }; | ||
|
|
||
| describe("Frustrated Customer Handling", () => { | ||
| it("should handle a frustrated non-technical customer with empathy", async () => { | ||
| const result = await scenario.run({ | ||
| name: "frustrated customer with internet issues", | ||
| description: ` | ||
| User is a non-technical person experiencing slow internet for 3 days. | ||
| They've already tried calling support twice with no resolution. | ||
| They're frustrated and tired of technical jargon. They just want | ||
| their internet to work and are losing patience with troubleshooting steps. | ||
| `, | ||
| agents: [ | ||
| technicalSupportAgent, | ||
| scenario.userSimulatorAgent({ | ||
| model: openai("gpt-4o"), | ||
| temperature: 0.3, // Some variability for realistic frustration | ||
| }), | ||
| scenario.judgeAgent({ | ||
| model: openai("gpt-4o"), | ||
| criteria: [ | ||
| "Agent acknowledges the customer's frustration empathetically", | ||
| "Agent avoids excessive technical jargon", | ||
| "Agent provides simple, clear instructions", | ||
| "Agent offers escalation if troubleshooting doesn't work", | ||
| "Agent remains professional despite customer frustration", | ||
| ], | ||
| }), | ||
| ], | ||
| maxTurns: 10, | ||
| }); | ||
|
|
||
| expect(result.success).toBe(true); | ||
| console.log(`Test completed with ${result.messages.length} messages`); | ||
| }, 60_000); | ||
| }); | ||
| ``` | ||
|
|
||
| ::: | ||
|
|
||
| --- | ||
|
|
||
| ## Configuration Reference | ||
|
|
||
| <Callout type="info"> | ||
| All parameters are optional. The user simulator will use global configuration defaults when parameters are not specified. | ||
| </Callout> | ||
|
|
||
| | Parameter | Type (Python / TypeScript) | Required | Default | Description | | ||
| |-----------|---------------------------|----------|---------|-------------| | ||
| | `model` | `str` / `LanguageModel` | No | Global config | LLM model identifier (PY: `"openai/gpt-4o"`, TS: `openai("gpt-4o")`). | | ||
| | `temperature` | `float` / `number` | No | `0.0` | Sampling temperature (0.0-1.0). Higher values (0.3-0.7) create more varied user messages. | | ||
| | `max_tokens` / `maxTokens` | `int` / `number` | No | Model default | Maximum tokens for user messages. Keep reasonable for natural brevity. | | ||
| | `system_prompt` / `systemPrompt` | `str` / `string` | No | Built-in | Custom system prompt to override default user simulation behavior. | | ||
| | `api_base` | `str` | No | Global config | **Python only**: Base URL for custom API endpoints. | | ||
| | `api_key` | `str` | No | Environment | **Python only**: API key for the model provider. | | ||
| | `**extra_params` | `dict` | No | `{}` | **Python only**: Additional [LiteLLM parameters](https://docs.litellm.ai/docs/completion/input) (`headers`, `timeout`, `client`). | | ||
| | `name` | `string` | No | `"User"` | **TypeScript only**: Display name in logs and traces. | | ||
Aryansharma28 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| --- | ||
|
|
||
| ## Next Steps | ||
|
|
||
| Explore related documentation: | ||
|
|
||
| - [Judge Agent](/basics/judge-agent) - Configure automated evaluation | ||
| - [Core Concepts](/basics/concepts) - Understand the simulation loop | ||
| - [Writing Scenarios](/basics/writing-scenarios) - Best practices for scenario design | ||
| - [Scripted Simulations](/basics/scripted-simulations) - Mix simulation with precise control | ||
| - [Configuration](/basics/configuration) - Set global defaults for all simulators | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.