-
Notifications
You must be signed in to change notification settings - Fork 45
feat(python): add span-based evaluation for judge agent #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
drewdrewthis
merged 28 commits into
main
from
issue186/use-opentelemetry-traces-as-part-of-judgment-also-for-python
Dec 12, 2025
Merged
feat(python): add span-based evaluation for judge agent #188
drewdrewthis
merged 28 commits into
main
from
issue186/use-opentelemetry-traces-as-part-of-judgment-also-for-python
Dec 12, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
llmsommelier
reviewed
Dec 9, 2025
llmsommelier
reviewed
Dec 9, 2025
llmsommelier
reviewed
Dec 9, 2025
llmsommelier
reviewed
Dec 9, 2025
llmsommelier
reviewed
Dec 9, 2025
llmsommelier
reviewed
Dec 9, 2025
llmsommelier
reviewed
Dec 9, 2025
9d5c224 to
b969830
Compare
Port JavaScript trace-based judge evaluation to Python: - Add _tracing module with JudgeSpanCollector (SpanProcessor) - Add _judge module with digest formatter and utilities - Update JudgeAgent to include OpenTelemetry traces in evaluation - Add span-based evaluation example test - Add 46 unit tests for new modules The judge now evaluates both conversation transcript and OpenTelemetry traces, enabling criteria like "HTTP call was made" or "tool was called".
- Rename test_span_based_evaluation.py to test_span_based_evaluation_native_otel.py - Update native OTEL test with clearer comments and docstrings - Add new test using @langwatch.span() decorator and set_attributes() API - Both tests validate that custom spans are visible to the judge
The span digest is already wrapped in <opentelemetry_traces> XML tags, making the === OPENTELEMETRY TRACES === header redundant.
- Add example showing context window overflow behavior - Add test verifying ScenarioRunFinishedEvent emits with ERROR status on exceptions
- Python: wrap errors with RuntimeError containing agent name, chain via `from e` - TypeScript: wrap errors with Error containing agent name, chain via `cause` - Add test verifying agent name appears in error messages - Update context window example to verify JudgeAgent identification
- Add logging.py config module with SCENARIO_LOG_LEVEL support - Import logging config on module init for side-effect setup
6751319 to
9bc8472
Compare
- Add specs/context-window-exceeded.feature with @integration scenarios - Move example to tests/test_context_window_exceeded_integration.py - Split into two tests matching feature scenarios - Delete examples/test_context_window_exceeded.py (error handling != happy path)
Defines invariants for Python span-based judge evaluation: - E2E: LangWatch decorators and native OTel API - Unit: span collector, digest formatter, media truncation, deduplication
- Replace real UserSimulatorAgent/JudgeAgent with mocks using correct roles - Update span formatter test to match implementation (no header, just message)
Tests were asserting for "JudgeAgent" but using MockJudgeAgent class.
Add proper type guards for Unset and None on event results.
72c34f2 to
fe420fa
Compare
- scripts/ci-wait.sh polls workflow status until complete - Updated cursor command to reference the script
- Allow multiple turns for agent to ask clarifying questions - Add flaky(reruns=2) for LLM behavior variance
Set thread_id once on parent span; child spans found via parent traversal in JudgeSpanCollector. Removes redundant per-span thread_id assignments and unused instance variable.
Executor now sets langwatch.thread.id attribute on agent call spans, matching JS implementation. Child spans are found via parent traversal without needing manual thread_id assignment in user code. - Add span.set_attributes() call in _call_agent method - Remove redundant manual thread_id from span examples - Update example docstrings to reflect automatic inheritance
- Add @pytest.mark.flaky(reruns=2) to test_user_is_hungry - Relax JS realtime test criteria to accept broader explanations
rogeriochaves
approved these changes
Dec 12, 2025
This was referenced Dec 12, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Port JavaScript trace-based judge evaluation to Python:
The judge now evaluates both conversation transcript and OpenTelemetry traces, enabling criteria like "HTTP call was made" or "tool was called".