feat(harness): instrument Anthropic streaming usage and tool calls by saschabuehrle · Pull Request #166 · lemony-ai/cascadeflow

saschabuehrle · 2026-03-04T15:29:41Z

Summary

instrument Anthropic messages.create(..., stream=True) responses in harness mode for both sync and async clients
track streamed token usage and tool-use blocks, then finalize run metrics at stream completion
keep pre-call decision metadata (action/reason/model/applied) in stream traces
add stream-focused tests for Anthropic sync and async wrappers

Validation

python3 -m pytest -q tests/test_harness_instrument.py
python3 -m pytest -q
live Anthropic e2e (real API):
- sync non-stream tracked
- sync stream with tool-use tracked (tool_calls=1)
- async stream tracked

Notes

this closes the previously documented Anthropic stream passthrough limitation in V2.1 instrumentation

…egration

Replace the instrument.py scaffold with a full implementation that patches openai.resources.chat.completions.Completions.create (sync) and AsyncCompletions.create (async) for harness observe/enforce modes. Key capabilities: - Class-level patching of sync and async create methods - Streaming wrappers (_InstrumentedStream, _InstrumentedAsyncStream) that capture usage metrics after all chunks are consumed - Cost estimation from a built-in pricing table - Energy estimation using deterministic model coefficients - Tool call counting in both response and streaming chunks - Budget remaining tracking within scoped runs - Idempotent patching with clean unpatch/reset path Context tracking per call: - cost, step_count, latency_used_ms, energy_used, tool_calls - budget_remaining auto-updated when budget_max is set - model_used and decision trace via ctx.record() Added step_count, latency_used_ms, energy_used fields to HarnessRunContext in api.py. Hooked patch_openai into init() and unpatch_openai into reset(). 39 new tests covering: patch lifecycle, sync/async wrappers, sync/async stream wrappers, cost/energy estimation, nested run isolation, and edge cases (no usage, no choices, missing chunks). All 63 harness tests pass (39 instrument + 24 api).

…m usage injection - init(mode="off") now calls unpatch_openai() if previously patched - Trace records actual mode (observe/enforce) instead of always "observe" - Enforce mode raises BudgetExceededError pre-call when budget exhausted - Auto-inject stream_options.include_usage=True for streaming requests - Add pytest.importorskip("openai") for graceful skip when not installed - 10 new tests covering all four fixes (73 total pass)

Replaces instrument.py stub with full OpenAI auto-instrumentation: - Sync/async monkey-patching of Completions.create - Stream wrappers for usage capture after consumption - Cost/energy estimation with pricing tables - Pre-call budget gate in enforce mode (BudgetExceededError) - Auto-inject stream_options.include_usage for streaming - init(mode="off") properly unpatches previously patched client - 49 instrument tests + 24 API tests = 73 total passing

…n' into feature/agent-intelligence-v2-integration

Implements cascadeflow.integrations.crewai module that hooks into CrewAI's native llm_hooks system (v1.5+) to feed cost, latency, energy, and step metrics into harness run contexts. - before_llm_call: budget gate in enforce mode, latency tracking - after_llm_call: token estimation, cost/energy/step accounting - enable()/disable() lifecycle with fail_open and budget_gate config - 37 tests covering hooks, estimation, enable/disable, and edge cases - Fixed __init__.py import ordering (CREWAI_AVAILABLE before __all__)

- Add crewai extra to pyproject.toml (pip install cascadeflow[crewai]) - Handle dict messages in _extract_message_content (CrewAI passes {"role": "...", "content": "..."} not objects with .content attr) - Move budget gate check before start time recording so blocked calls don't leak entries in _call_start_times - Fix unused imports (field, TYPE_CHECKING, Callable) and import order - Fix docstring referencing nonexistent cost_model_override - Replace yield with return in test fixture (PT022) - Add 7 new tests: dict/object message extraction, blocked call leak

…ore/v2-merge-validation

- Document enforce-mode limitations for switch_model and deny_tool - Replace per-handler _executed_tool_calls with run_ctx.tool_calls - Fix _extract_candidate_state fallback leaking arbitrary kwargs - Remove return-in-finally (B012) and fix import ordering - Separate langgraph from langchain optional extra in pyproject.toml - Add 4 edge-case tests: no-run-context safety, state extraction guard, and run_ctx tool_calls gating

…xtraction

- Use time.monotonic() for duration_ms calculation instead of wall-clock delta (avoids NTP/suspend clock jumps) - Extract sanitize constants (_MAX_ACTION_LEN, _MAX_REASON_LEN, _MAX_MODEL_LEN) - Log warning when record() receives empty action (was silently defaulting) - Cache CallbackEvent import in _emit_harness_decision for hot-path perf - Add tests: no-callback-manager noop, empty-action warning, duration field

Add 5 new benchmark modules and 15 unit tests that enable third-party reproducibility and automated V2 readiness checks: - repro.py: environment fingerprint (git SHA, packages, platform) - baseline.py: save/load baselines, delta comparison, Go/No-Go gates - harness_overhead.py: decision-path p95 measurement (<5ms gate) - observe_validation.py: observe-mode zero-change proof (6 cases) - artifact.py: JSON artifact bundler + REPRODUCE.md generation Extends run_all.py with --baseline, --harness-mode, --with-repro flags.

…ence-v2-integration # Conflicts: # tests/test_harness_api.py

…-integration # Conflicts: # docs/strategy/agent-intelligence-v2-plan.md # tests/test_harness_api.py

Add 29 tests covering the Anthropic Python SDK monkey-patching that was introduced in v2.1. Tests cover usage extraction, tool call counting, sync/async wrapper behavior, budget enforcement in enforce mode, stream passthrough, cost/energy/latency tracking, and init/reset lifecycle.

saschabuehrle · 2026-03-05T16:33:17Z

Superseded by #164 — all commits included in the integration train.

saschabuehrle added 30 commits February 25, 2026 22:30

Add core harness API scaffold with context-scoped runtime

1aba349

Harden harness core scaffolding and complete API test coverage

968d329

Harden harness core scaffolding and complete API test coverage

5d9f199

Merge feat/v2-core-harness-api into feature/agent-intelligence-v2-int…

9c82f86

…egration

Add OpenAI Agents SDK harness integration (opt-in)

47f1895

fix(openai-agents): align SDK interface and enforce-safe errors

2c52bbd

Merge remote-tracking branch 'origin/feat/v2-openai-agents-integratio…

1d68fa6

…n' into feature/agent-intelligence-v2-integration

docs(plan): claim v2 enforce-actions feature branch

70585c7

feat(harness): enforce switch-model, deny-tool, and stop actions

c66ab28

feat(harness): implement enforce actions for v2 harness

8f1ed32

fix(harness): clarify observe traces and hard-stop semantics

6b7a0f5

Merge remote-tracking branch 'origin/feat/v2-enforce-actions' into ch…

060fa74

…ore/v2-merge-validation

perf(harness): optimize model utility hot paths

63cf21e

refactor(harness): unify pricing profiles across integrations

47596ed

docs(plan): claim langchain harness extension branch

b1af4f5

feat(langchain): add harness-aware callback and state extractor

d6556cf

feat(langchain): auto-attach harness callback in active run scopes

e9cd7a3

docs(plan): mark langchain harness extension branch completed

5c79ce5

fix(langchain): enforce tool caps on executed calls and harden tool e…

3119778

…xtraction

feat(harness): add privacy-safe decision telemetry and callback hooks

10b80f8

fix(harness): avoid shadowing cascadeflow.agent module

fa88f1f

fix(harness): avoid shadowing cascadeflow.agent module

1d25b8f

saschabuehrle added 12 commits March 2, 2026 11:36

fix(harness): avoid shadowing cascadeflow.agent module

d5e8143

docs(plan): update workboard — bench-repro-pipeline PR #163 in review

fa69486

style(bench): apply linter formatting to repro pipeline files

df9e72c

style(ci): format Python files for Black

9505d26

style: apply black formatting for harness integration files

f082377

Merge feat/v2-langchain-harness-extension into feature/agent-intellig…

cec4bab

…ence-v2-integration # Conflicts: # tests/test_harness_api.py

Merge feat/v2-bench-repro-pipeline into feature/agent-intelligence-v2…

778a3ff

…-integration # Conflicts: # docs/strategy/agent-intelligence-v2-plan.md # tests/test_harness_api.py

style(langchain): finalize harness callback typing and formatting

e02ebe9

feat(harness): add anthropic python auto-instrumentation for v2.1

0b38bf9

feat(core): deliver v2.1 ts harness parity and sdk auto-instrumentation

998c6a6

feat(harness): instrument Anthropic streaming usage and tool calls

d01799f

github-actions bot added documentation Improvements or additions to documentation lang: typescript dependencies examples lang: python tests core configuration size/xl labels Mar 4, 2026

fix(harness): finalize stream metrics on errors and harden env parsing

26a6ecb

saschabuehrle closed this Mar 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(harness): instrument Anthropic streaming usage and tool calls#166

feat(harness): instrument Anthropic streaming usage and tool calls#166
saschabuehrle wants to merge 43 commits intomainfrom
feat/v2.1-anthropic-python-auto-instrumentation

saschabuehrle commented Mar 4, 2026

Uh oh!

saschabuehrle commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saschabuehrle commented Mar 4, 2026

Summary

Validation

Notes

Uh oh!

saschabuehrle commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant