Skip to content

feat(harness): instrument Anthropic streaming usage and tool calls#166

Closed
saschabuehrle wants to merge 43 commits intomainfrom
feat/v2.1-anthropic-python-auto-instrumentation
Closed

feat(harness): instrument Anthropic streaming usage and tool calls#166
saschabuehrle wants to merge 43 commits intomainfrom
feat/v2.1-anthropic-python-auto-instrumentation

Conversation

@saschabuehrle
Copy link
Copy Markdown
Collaborator

Summary

  • instrument Anthropic messages.create(..., stream=True) responses in harness mode for both sync and async clients
  • track streamed token usage and tool-use blocks, then finalize run metrics at stream completion
  • keep pre-call decision metadata (action/reason/model/applied) in stream traces
  • add stream-focused tests for Anthropic sync and async wrappers

Validation

  • python3 -m pytest -q tests/test_harness_instrument.py
  • python3 -m pytest -q
  • live Anthropic e2e (real API):
    • sync non-stream tracked
    • sync stream with tool-use tracked (tool_calls=1)
    • async stream tracked

Notes

  • this closes the previously documented Anthropic stream passthrough limitation in V2.1 instrumentation

Replace the instrument.py scaffold with a full implementation that patches
openai.resources.chat.completions.Completions.create (sync) and
AsyncCompletions.create (async) for harness observe/enforce modes.

Key capabilities:
- Class-level patching of sync and async create methods
- Streaming wrappers (_InstrumentedStream, _InstrumentedAsyncStream)
  that capture usage metrics after all chunks are consumed
- Cost estimation from a built-in pricing table
- Energy estimation using deterministic model coefficients
- Tool call counting in both response and streaming chunks
- Budget remaining tracking within scoped runs
- Idempotent patching with clean unpatch/reset path

Context tracking per call:
- cost, step_count, latency_used_ms, energy_used, tool_calls
- budget_remaining auto-updated when budget_max is set
- model_used and decision trace via ctx.record()

Added step_count, latency_used_ms, energy_used fields to
HarnessRunContext in api.py. Hooked patch_openai into init()
and unpatch_openai into reset().

39 new tests covering: patch lifecycle, sync/async wrappers,
sync/async stream wrappers, cost/energy estimation, nested run
isolation, and edge cases (no usage, no choices, missing chunks).

All 63 harness tests pass (39 instrument + 24 api).
…m usage injection

- init(mode="off") now calls unpatch_openai() if previously patched
- Trace records actual mode (observe/enforce) instead of always "observe"
- Enforce mode raises BudgetExceededError pre-call when budget exhausted
- Auto-inject stream_options.include_usage=True for streaming requests
- Add pytest.importorskip("openai") for graceful skip when not installed
- 10 new tests covering all four fixes (73 total pass)
Replaces instrument.py stub with full OpenAI auto-instrumentation:
- Sync/async monkey-patching of Completions.create
- Stream wrappers for usage capture after consumption
- Cost/energy estimation with pricing tables
- Pre-call budget gate in enforce mode (BudgetExceededError)
- Auto-inject stream_options.include_usage for streaming
- init(mode="off") properly unpatches previously patched client
- 49 instrument tests + 24 API tests = 73 total passing
…n' into feature/agent-intelligence-v2-integration
Implements cascadeflow.integrations.crewai module that hooks into
CrewAI's native llm_hooks system (v1.5+) to feed cost, latency,
energy, and step metrics into harness run contexts.

- before_llm_call: budget gate in enforce mode, latency tracking
- after_llm_call: token estimation, cost/energy/step accounting
- enable()/disable() lifecycle with fail_open and budget_gate config
- 37 tests covering hooks, estimation, enable/disable, and edge cases
- Fixed __init__.py import ordering (CREWAI_AVAILABLE before __all__)
- Add crewai extra to pyproject.toml (pip install cascadeflow[crewai])
- Handle dict messages in _extract_message_content (CrewAI passes
  {"role": "...", "content": "..."} not objects with .content attr)
- Move budget gate check before start time recording so blocked calls
  don't leak entries in _call_start_times
- Fix unused imports (field, TYPE_CHECKING, Callable) and import order
- Fix docstring referencing nonexistent cost_model_override
- Replace yield with return in test fixture (PT022)
- Add 7 new tests: dict/object message extraction, blocked call leak
- Document enforce-mode limitations for switch_model and deny_tool
- Replace per-handler _executed_tool_calls with run_ctx.tool_calls
- Fix _extract_candidate_state fallback leaking arbitrary kwargs
- Remove return-in-finally (B012) and fix import ordering
- Separate langgraph from langchain optional extra in pyproject.toml
- Add 4 edge-case tests: no-run-context safety, state extraction
  guard, and run_ctx tool_calls gating
- Use time.monotonic() for duration_ms calculation instead of wall-clock
  delta (avoids NTP/suspend clock jumps)
- Extract sanitize constants (_MAX_ACTION_LEN, _MAX_REASON_LEN, _MAX_MODEL_LEN)
- Log warning when record() receives empty action (was silently defaulting)
- Cache CallbackEvent import in _emit_harness_decision for hot-path perf
- Add tests: no-callback-manager noop, empty-action warning, duration field
Add 5 new benchmark modules and 15 unit tests that enable third-party
reproducibility and automated V2 readiness checks:

- repro.py: environment fingerprint (git SHA, packages, platform)
- baseline.py: save/load baselines, delta comparison, Go/No-Go gates
- harness_overhead.py: decision-path p95 measurement (<5ms gate)
- observe_validation.py: observe-mode zero-change proof (6 cases)
- artifact.py: JSON artifact bundler + REPRODUCE.md generation

Extends run_all.py with --baseline, --harness-mode, --with-repro flags.
@saschabuehrle
Copy link
Copy Markdown
Collaborator Author

Superseded by #164 — all commits included in the integration train.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant