feat(python): add span-based evaluation for judge agent #188

drewdrewthis · 2025-12-08T14:11:01Z

Port JavaScript trace-based judge evaluation to Python:

Add _tracing module with JudgeSpanCollector (SpanProcessor)
Add _judge module with digest formatter and utilities
Update JudgeAgent to include OpenTelemetry traces in evaluation
Add span-based evaluation example test
Add 46 unit tests for new modules

The judge now evaluates both conversation transcript and OpenTelemetry traces, enabling criteria like "HTTP call was made" or "tool was called".

python/examples/test_span_based_evaluation_native_otel.py

python/scenario/_judge/judge_span_digest_formatter.py

python/scenario/_judge/judge_utils.py

python/scenario/judge_agent.py

Port JavaScript trace-based judge evaluation to Python: - Add _tracing module with JudgeSpanCollector (SpanProcessor) - Add _judge module with digest formatter and utilities - Update JudgeAgent to include OpenTelemetry traces in evaluation - Add span-based evaluation example test - Add 46 unit tests for new modules The judge now evaluates both conversation transcript and OpenTelemetry traces, enabling criteria like "HTTP call was made" or "tool was called".

- Rename test_span_based_evaluation.py to test_span_based_evaluation_native_otel.py - Update native OTEL test with clearer comments and docstrings - Add new test using @langwatch.span() decorator and set_attributes() API - Both tests validate that custom spans are visible to the judge

The span digest is already wrapped in <opentelemetry_traces> XML tags, making the === OPENTELEMETRY TRACES === header redundant.

- Add example showing context window overflow behavior - Add test verifying ScenarioRunFinishedEvent emits with ERROR status on exceptions

- Python: wrap errors with RuntimeError containing agent name, chain via `from e` - TypeScript: wrap errors with Error containing agent name, chain via `cause` - Add test verifying agent name appears in error messages - Update context window example to verify JudgeAgent identification

- Add logging.py config module with SCENARIO_LOG_LEVEL support - Import logging config on module init for side-effect setup

- Add specs/context-window-exceeded.feature with @integration scenarios - Move example to tests/test_context_window_exceeded_integration.py - Split into two tests matching feature scenarios - Delete examples/test_context_window_exceeded.py (error handling != happy path)

Defines invariants for Python span-based judge evaluation: - E2E: LangWatch decorators and native OTel API - Unit: span collector, digest formatter, media truncation, deduplication

- Replace real UserSimulatorAgent/JudgeAgent with mocks using correct roles - Update span formatter test to match implementation (no header, just message)

Tests were asserting for "JudgeAgent" but using MockJudgeAgent class.

Add proper type guards for Unset and None on event results.

- scripts/ci-wait.sh polls workflow status until complete - Updated cursor command to reference the script

- Allow multiple turns for agent to ask clarifying questions - Add flaky(reruns=2) for LLM behavior variance

python/examples/test_weather_agent.py

Set thread_id once on parent span; child spans found via parent traversal in JudgeSpanCollector. Removes redundant per-span thread_id assignments and unused instance variable.

Executor now sets langwatch.thread.id attribute on agent call spans, matching JS implementation. Child spans are found via parent traversal without needing manual thread_id assignment in user code. - Add span.set_attributes() call in _call_agent method - Remove redundant manual thread_id from span examples - Update example docstrings to reflect automatic inheritance

- Add @pytest.mark.flaky(reruns=2) to test_user_is_hungry - Relax JS realtime test criteria to accept broader explanations

drewdrewthis self-assigned this Dec 8, 2025

drewdrewthis added this to LangWatch Kanban Dec 8, 2025

github-project-automation bot moved this to Backlog in LangWatch Kanban Dec 8, 2025

drewdrewthis linked an issue Dec 8, 2025 that may be closed by this pull request

Use opentelemetry traces as part of judgment also for Python #186

Closed

drewdrewthis marked this pull request as ready for review December 8, 2025 17:27

drewdrewthis requested review from 0xdeafcafe, Aryansharma28, llmsommelier, richhuth and rogeriochaves December 8, 2025 17:27

drewdrewthis moved this from Backlog to In review in LangWatch Kanban Dec 8, 2025