Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -66,12 +66,13 @@
]
},
{
"group": "Behaviors",
"group": "Judges",
"icon": "gavel",
"pages": [
"behaviors/introduction",
"behaviors/setup",
"behaviors/pull-evaluations"
"judges/introduction",
"judges/setup",
"judges/submit-feedback",
"judges/pull-evaluations"
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ description: "Start improving your AI applications with ZeroEval"
performance
</Card>
<Card
title="Behaviors"
title="Judges"
icon="gavel"
href="/behaviors/introduction"
href="/judges/introduction"
>
Get reliable AI evaluation with judges that are calibrated to human
preferences
Expand Down
6 changes: 3 additions & 3 deletions behaviors/introduction.mdx → judges/introduction.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,12 @@ description: "Continuously evaluate your production traffic with judges that lea

<video src="/videos/calibrated-judge.mp4" controls muted playsInline loop preload="metadata" />

Calibrated LLM judges are AI evaluators that watch your traces, sessions, or spans and score behavior according to criteria you define. They get better over time the more you refine and correct their evaluations.
Calibrated LLM judges are AI evaluators that watch your traces, sessions, or spans and score outputs according to criteria you define. They get better over time the more you refine and correct their evaluations.

## When to use

Use a behavior when you want consistent, scalable evaluation of:
Use a judge when you want consistent, scalable evaluation of:

- Hallucinations, safety/policy violations
- Response quality (helpfulness, tone, structure)
- Latency, cost, and error patterns tied to behaviors
- Latency, cost, and error patterns tied to specific criteria
10 changes: 7 additions & 3 deletions behaviors/pull-evaluations.mdx → judges/pull-evaluations.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ import zeroeval as ze

ze.init(api_key="YOUR_API_KEY")

response = ze.get_behavior_evaluations(
response = ze.get_judge_evaluations(
project_id="your-project-id",
judge_id="your-judge-id",
limit=100,
Expand All @@ -44,7 +44,7 @@ for eval in response["evaluations"]:
**Optional filters:**

```python
response = ze.get_behavior_evaluations(
response = ze.get_judge_evaluations(
project_id="your-project-id",
judge_id="your-judge-id",
limit=100,
Expand Down Expand Up @@ -150,7 +150,7 @@ offset = 0
limit = 100

while True:
response = ze.get_behavior_evaluations(
response = ze.get_judge_evaluations(
project_id="your-project-id",
judge_id="your-judge-id",
limit=limit,
Expand All @@ -166,3 +166,7 @@ while True:

print(f"Fetched {len(all_evaluations)} total evaluations")
```

## Related

- [Submitting Feedback](/judges/submit-feedback) - Programmatically submit feedback for judge evaluations
4 changes: 2 additions & 2 deletions behaviors/setup.mdx → judges/setup.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ description: "Create and calibrate an AI judge in minutes"

## Creating a judge (&lt;5 mins)

1. Go to [Monitoring → Judges → New Judge](https://app.zeroeval.com/monitoring/signal-automations).
2. Sepcify the behaviour that you want to track from your production traffic.
1. Go to [Monitoring → Judges → New Judge](https://app.zeroeval.com/monitoring/judges).
2. Specify the criteria that you want to evaluate from your production traffic.
3. Tweak the prompt of the judge until it matches what you are looking for!

That's it! Historical and future traces will be scored automatically and shown in the dashboard.
Expand Down
119 changes: 119 additions & 0 deletions judges/submit-feedback.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
---
title: "Submitting Feedback"
description: "Programmatically submit feedback for judge evaluations via SDK"
---

## Overview

When calibrating judges, you can submit feedback programmatically using the SDK.
This is useful for:

- Bulk feedback submission from automated pipelines
- Integration with custom review workflows
- Syncing feedback from external labeling tools

## Important: Using the Correct IDs

Judge evaluations involve two related spans:

| ID | Description |
|---|---|
| **Source Span ID** | The original LLM call that was evaluated |
| **Judge Call Span ID** | The span created when the judge ran its evaluation |

When submitting feedback, always include the `judge_id` parameter to ensure
feedback is correctly associated with the judge evaluation.

## Python SDK

### From the UI (Recommended)

The easiest way to get the correct IDs is from the Judge Evaluation modal:

1. Open a judge evaluation in the dashboard
2. Expand the "SDK Integration" section
3. Click "Copy" to copy the pre-filled Python code
4. Paste and customize the generated code

### Manual Submission

```python
from zeroeval import ZeroEval

client = ZeroEval()

# Submit feedback for a judge evaluation
client.send_feedback(
prompt_slug="your-judge-task-slug", # The task/prompt associated with the judge
completion_id="span-id-here", # The span ID from the evaluation
thumbs_up=True, # True = correct, False = incorrect
reason="Optional explanation",
judge_id="automation-id-here", # Required for judge feedback
)
```

### Parameters

| Parameter | Type | Required | Description |
|---|---|---|---|
| `prompt_slug` | str | Yes | The task slug associated with the judge |
| `completion_id` | str | Yes | The span ID being evaluated |
| `thumbs_up` | bool | Yes | `True` if judge was correct, `False` if wrong |
| `reason` | str | No | Explanation of the feedback |
| `judge_id` | str | Yes* | The judge automation ID (*required for judge feedback) |

## REST API

```bash
curl -X POST "https://api.zeroeval.com/v1/prompts/{task_slug}/completions/{span_id}/feedback" \
-H "Authorization: Bearer $ZEROEVAL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"thumbs_up": true,
"reason": "Judge correctly identified the issue",
"judge_id": "automation-uuid-here"
}'
```

## Finding Your IDs

| ID | Where to Find It |
|---|---|
| **Task Slug** | In the judge settings, or the URL when editing the judge's prompt |
| **Span ID** | In the evaluation modal, or via `get_judge_evaluations()` response |
| **Judge ID** | In the URL when viewing a judge (`/judges/{judge_id}`) |

## Bulk Feedback Submission

For submitting feedback on multiple evaluations, you can iterate through evaluations:

```python
from zeroeval import ZeroEval

client = ZeroEval()

# Get evaluations to review
evaluations = client.get_judge_evaluations(
project_id="your-project-id",
judge_id="your-judge-id",
limit=100,
)

# Submit feedback for each
for eval in evaluations["evaluations"]:
# Your logic to determine if the evaluation was correct
is_correct = your_review_logic(eval)

client.send_feedback(
prompt_slug="your-judge-task-slug",
completion_id=eval["span_id"],
thumbs_up=is_correct,
reason="Automated review",
judge_id="your-judge-id",
)
```

## Related

- [Pulling Evaluations](/judges/pull-evaluations) - Retrieve judge evaluations programmatically
- [Judge Setup](/judges/setup) - Configure and deploy judges