Skip to content

Conversation

@drewdrewthis
Copy link
Collaborator

Port JavaScript trace-based judge evaluation to Python:

  • Add _tracing module with JudgeSpanCollector (SpanProcessor)
  • Add _judge module with digest formatter and utilities
  • Update JudgeAgent to include OpenTelemetry traces in evaluation
  • Add span-based evaluation example test
  • Add 46 unit tests for new modules

The judge now evaluates both conversation transcript and OpenTelemetry traces, enabling criteria like "HTTP call was made" or "tool was called".

@drewdrewthis drewdrewthis self-assigned this Dec 8, 2025
@drewdrewthis drewdrewthis linked an issue Dec 8, 2025 that may be closed by this pull request
@drewdrewthis drewdrewthis marked this pull request as ready for review December 8, 2025 17:27
@drewdrewthis drewdrewthis moved this from Backlog to In review in LangWatch Kanban Dec 8, 2025
@drewdrewthis drewdrewthis force-pushed the issue186/use-opentelemetry-traces-as-part-of-judgment-also-for-python branch from 9d5c224 to b969830 Compare December 11, 2025 14:51
Port JavaScript trace-based judge evaluation to Python:
- Add _tracing module with JudgeSpanCollector (SpanProcessor)
- Add _judge module with digest formatter and utilities
- Update JudgeAgent to include OpenTelemetry traces in evaluation
- Add span-based evaluation example test
- Add 46 unit tests for new modules

The judge now evaluates both conversation transcript and OpenTelemetry
traces, enabling criteria like "HTTP call was made" or "tool was called".
- Rename test_span_based_evaluation.py to test_span_based_evaluation_native_otel.py
- Update native OTEL test with clearer comments and docstrings
- Add new test using @langwatch.span() decorator and set_attributes() API
- Both tests validate that custom spans are visible to the judge
The span digest is already wrapped in <opentelemetry_traces> XML tags,
making the === OPENTELEMETRY TRACES === header redundant.
- Add example showing context window overflow behavior
- Add test verifying ScenarioRunFinishedEvent emits with ERROR status on exceptions
- Python: wrap errors with RuntimeError containing agent name, chain via `from e`
- TypeScript: wrap errors with Error containing agent name, chain via `cause`
- Add test verifying agent name appears in error messages
- Update context window example to verify JudgeAgent identification
- Add logging.py config module with SCENARIO_LOG_LEVEL support
- Import logging config on module init for side-effect setup
@drewdrewthis drewdrewthis force-pushed the issue186/use-opentelemetry-traces-as-part-of-judgment-also-for-python branch from 6751319 to 9bc8472 Compare December 11, 2025 20:26
- Add specs/context-window-exceeded.feature with @integration scenarios
- Move example to tests/test_context_window_exceeded_integration.py
- Split into two tests matching feature scenarios
- Delete examples/test_context_window_exceeded.py (error handling != happy path)
Defines invariants for Python span-based judge evaluation:
- E2E: LangWatch decorators and native OTel API
- Unit: span collector, digest formatter, media truncation, deduplication
- Replace real UserSimulatorAgent/JudgeAgent with mocks using correct roles
- Update span formatter test to match implementation (no header, just message)
Tests were asserting for "JudgeAgent" but using MockJudgeAgent class.
Add proper type guards for Unset and None on event results.
@drewdrewthis drewdrewthis force-pushed the issue186/use-opentelemetry-traces-as-part-of-judgment-also-for-python branch from 72c34f2 to fe420fa Compare December 11, 2025 21:37
- scripts/ci-wait.sh polls workflow status until complete
- Updated cursor command to reference the script
- Allow multiple turns for agent to ask clarifying questions
- Add flaky(reruns=2) for LLM behavior variance
rogeriochaves and others added 6 commits December 12, 2025 11:56
Set thread_id once on parent span; child spans found via parent traversal
in JudgeSpanCollector. Removes redundant per-span thread_id assignments
and unused instance variable.
Executor now sets langwatch.thread.id attribute on agent call spans,
matching JS implementation. Child spans are found via parent traversal
without needing manual thread_id assignment in user code.

- Add span.set_attributes() call in _call_agent method
- Remove redundant manual thread_id from span examples
- Update example docstrings to reflect automatic inheritance
- Add @pytest.mark.flaky(reruns=2) to test_user_is_hungry
- Relax JS realtime test criteria to accept broader explanations
@drewdrewthis drewdrewthis merged commit 59352ec into main Dec 12, 2025
4 checks passed
@drewdrewthis drewdrewthis deleted the issue186/use-opentelemetry-traces-as-part-of-judgment-also-for-python branch December 12, 2025 15:04
@github-project-automation github-project-automation bot moved this from In review to Done in LangWatch Kanban Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Use opentelemetry traces as part of judgment also for Python

4 participants