Skip to content

feat(evaluator): parity sync, Claude SDK adapter, kiro per-stage gates, shared simulator#235

Open
harmjeff wants to merge 37 commits into
awslabs:mainfrom
harmjeff:fix/evaluator-update
Open

feat(evaluator): parity sync, Claude SDK adapter, kiro per-stage gates, shared simulator#235
harmjeff wants to merge 37 commits into
awslabs:mainfrom
harmjeff:fix/evaluator-update

Conversation

@harmjeff
Copy link
Copy Markdown
Contributor

@harmjeff harmjeff commented Apr 30, 2026

Summary

Brings scripts/aidlc-evaluator to parity with the internal GitLab baseline and adds significant new CLI evaluation capabilities: a Claude SDK adapter with mid-workflow simulator handoffs, kiro-cli per-stage reviewer gates, a shared HumanSimulator injected by the orchestrator, and a plugin adapter registration system. All changes are scoped to scripts/aidlc-evaluator/.

Changes

Parity sync (GitLab → GitHub)

  • run.py: add git-compare / git-compare-report modes; Docker/Podman/Finch preflight check
  • scripts/run_evaluation.py: restore --rules-repo flag; deterministic run folder pre-allocation (eliminates sentinel file races, enables parallel execution)
  • Add scripts/run_git_compare.py, regenerate_git_compare_report.py, generate_html_report.py

Claude SDK adapter (--cli claude-code-sdk)

  • New ClaudeCodeSDKAdapter drives the AIDLC executor via anthropic.AnthropicBedrock turn-by-turn
  • Intercepts handoff_to_simulator tool calls to inject human-analog reviews mid-workflow
  • Reuses EXECUTOR_SYSTEM_PROMPT from packages/execution — no prompt duplication
  • Post-run tests wired in; separate executor/simulator token buckets in metrics

Kiro-cli per-stage simulator gates

  • 4 explicit stage gates: Requirements → Design → Code-gen plan → Construction
  • Uses kiro's native --no-interactive + --resume to inject simulator feedback between stages

Shared HumanSimulator

  • Single HumanSimulator in packages/cli-harness/src/cli_harness/simulator.py used by all three execution modes (Strands swarm, claude-code-sdk, kiro-cli)
  • Orchestrator constructs it once with full document context (vision + tech_env + OpenAPI) and injects via AdapterConfig.simulator — adapters no longer construct it themselves
  • build_simulator_system_prompt() is the single source of truth in packages/execution

Plugin adapter registration

  • registry.register_adapter(name, fqn) and load_adapters_from_config(cfg_data) allow adding CLI adapters via config/default.yaml with no framework code changes
  • config/default.yaml gains cli.adapters: {} extension point

Infrastructure fixes

  • shared/sandbox.py: auto-detect docker, podman, or finch as container runtime
  • docker/sandbox/Dockerfile: pin to Python 3.13 (was 3.14, causing uvicorn startup failures in generated code)
  • packages/execution/runner.py: Mode 1 direct-folder path for orchestrator-specified run folders
  • run.py: --no-sandbox skips container preflight; podman recognised alongside docker

Documentation

  • ARCHITECTURE.md: new section 6.5 CLI Evaluation + full "Adding a New CLI Adapter" cookbook with worked example and contracts table
  • CONTRIBUTING.md: updated package list and work streams table to include cli-harness, ide-harness, trend-reports

User experience

Before: The evaluator only supported the Strands two-agent swarm for full runs, with no programmatic CLI adapter having a human reviewer. The claude-code adapter ran as a one-shot subprocess with no simulation.

After: Three execution modes all use the same human-analog simulator with full document context:

  • run.py full — Strands swarm with HumanSimulator tool
  • run.py cli --cli claude-code-sdk — SDK-driven executor with inline simulator handoffs
  • run.py cli --cli kiro-cli — kiro subprocess with 4 per-stage simulator review gates

New adapters can be registered with one config line and zero framework changes.

Checklist

  • I have reviewed the contributing guidelines
  • I have performed a self-review of this change
  • Changes have been tested
  • Changes are documented

Test Plan

cd scripts/aidlc-evaluator
uv sync

# Verify all adapters registered and ready
uv run python run.py cli --list

# Run all three modes (can be run in parallel)
uv run python run.py full --no-sandbox
uv run python run.py cli --cli claude-code-sdk \
  --vision test_cases/sci-calc/vision.md \
  --golden test_cases/sci-calc/golden-aidlc-docs \
  --openapi test_cases/sci-calc/openapi.yaml \
  --rules-path <repo-root>
uv run python run.py cli --cli kiro-cli \
  --vision test_cases/sci-calc/vision.md \
  --golden test_cases/sci-calc/golden-aidlc-docs \
  --openapi test_cases/sci-calc/openapi.yaml \
  --rules-path <repo-root>

Expected results for each mode:

  • Post-run tests: PASS
  • Contract tests: 88/88 PASS
  • Qualitative score: ~0.75–0.80

Acknowledgment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

harmjeff and others added 24 commits April 29, 2026 14:07
…mulator

Parity with gitlab/aidlc-regression:
- run.py: add git-compare / git-compare-report modes, Docker sandbox preflight
  check (DOCKER_DEPENDENT_MODES + check_docker_sandbox())
- run_evaluation.py: restore --rules-repo flag, pass through to aidlc_runner,
  save to evaluation-config.yaml; add direct-folder shortcut in stage_execute()
  so git-compare orchestrator can specify exact timestamped output paths
- scripts/: add run_git_compare.py, regenerate_git_compare_report.py,
  generate_html_report.py (hard dependency of git-compare)

New: claude-code-sdk adapter (packages/cli-harness):
- ClaudeCodeSDKAdapter drives the executor via anthropic.AnthropicBedrock
  instead of `claude -p` subprocess, enabling interactive mid-workflow
  handoffs to an embedded Human Simulator
- Reuses EXECUTOR_SYSTEM_PROMPT and SIMULATOR_SYSTEM_PROMPT_TEMPLATE verbatim
  from packages/execution — no prompt duplication
- Simulator is injected on each handoff_to_simulator tool call; only file
  tools (read/write/list) available to simulator, not run_command
- Token usage tracked separately for executor and simulator buckets;
  compatible with existing run-metrics.yaml schema
- AdapterConfig gains simulator_model and aws_region fields
- orchestrator.py passes aws_region into AdapterConfig
- cli-harness pyproject.toml adds anthropic[bedrock]>=0.40, boto3>=1.42.47

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
… with --no-sandbox

check_docker_sandbox() now tries podman if docker is not on PATH.
The preflight is skipped entirely when --no-sandbox is passed so the
evaluator can run on hosts without a container runtime.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…r allocation

Container runtime (sandbox.py):
- Replace all hardcoded 'docker' CLI calls with _get_container_cli() which
  probes docker, podman, and finch in order and caches the result
- is_docker_available() now works with any of the three runtimes
- sandbox_run/run_detached/stop/is_running/logs all use the detected CLI

Run folder allocation (run_evaluation.py):
- stage_execute() now pre-allocates the exact timestamped run folder
  (same {timestamp}-{slug} format as runner.py) before invoking aidlc_runner,
  then passes it as --output-dir so the runner uses it directly (Mode 1)
- Eliminates sentinel file reading, before/after directory diffing, and the
  direct-folder heuristic — the path is deterministic from the start
- Enables safe parallel execution: each caller pre-allocates its own folder
- Removed now-dead helpers: _SENTINEL_NAME, _read_run_sentinel,
  _list_run_folders, _find_new_run
- Added _rules_slug() helper to mirror runner.py slug generation
- stage_execute() signature gains cfg_data: dict parameter

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Without this, anthropic[bedrock] was not installed in the shared venv
and the claude-code-sdk adapter could not be loaded.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…path

runner.py:
- Add Mode 1: if --output-dir is already a timestamped folder name, use it
  directly instead of creating a nested subfolder — required for the
  deterministic run folder pre-allocation in run_evaluation.py to work
- Sentinel write now only happens in Mode 2 (new subfolder)
- mkdir calls use exist_ok=True for idempotency

claude_code_sdk.py:
- Use config.rules_path directly instead of re-copying rules (orchestrator
  already set them up at output_dir/aidlc-rules before calling adapter.run())
- Fix credentials: use get_frozen_credentials() not .resolve()
- Fix default model: global.anthropic.claude-opus-4-6-v1 (not a fake 4-7 ID)
- Handle missing 'content' key in write_file tool input gracefully

orchestrator.py:
- Fix run_evaluation.py path: scripts/run_evaluation.py not run_evaluation.py
  at repo root

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Calls run_post_evaluation() from packages/execution after the executor
loop completes, matching the behaviour of the Strands runner. Detects
the project type in workspace/, installs deps, runs tests, and writes
test-results.yaml. Sandbox is enabled when a container runtime
(docker/podman/finch) is available.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
All three execution modes now give the human-analog agent full visibility
into the OpenAPI contract during design reviews and code review handoffs.

Strands swarm (full mode):
- simulator.py: SIMULATOR_SYSTEM_PROMPT_TEMPLATE gains {openapi_section}
  placeholder; create_simulator() accepts openapi_content parameter and
  renders a binding API contract section into the system prompt
- runner.py: run() accepts openapi_path, reads it, passes to create_simulator()
- cli.py: --openapi flag added; forwarded to run()
- run_evaluation.py stage_execute(): passes --openapi to aidlc_runner when present

Claude SDK adapter (cli mode, claude-code-sdk):
- Reads config.openapi_content and renders the same API contract section
  into the simulator system prompt before the executor loop starts

CLI subprocess adapters (claude-code, kiro-cli):
- prompt_template.py: render_prompt() accepts openapi_content; injects it
  as a binding contract section so the self-approving executor has the
  full spec in view during design and code generation
- Both adapters pass config.openapi_content to render_prompt()

Plumbing:
- AdapterConfig gains openapi_content: str | None field
- orchestrator.run_cli_evaluation() reads openapi_path and populates it

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
simulator.py (execution package):
- Extract build_simulator_system_prompt() as a standalone function so
  other adapters can construct the same prompt without Strands dependencies
- create_simulator() now delegates to it (no behaviour change)

simulator.py (cli-harness, new):
- HumanSimulator class: Anthropic SDK-based reviewer backed by
  build_simulator_system_prompt() — single implementation used by both
  kiro-cli and claude-code-sdk
- HumanSimulator.from_adapter_config() constructs from AdapterConfig fields
- HumanSimulator.respond() runs the turn loop with file tool support
- _exec_file_tool() provides read/write/list scoped to run_folder

kiro_cli.py:
- Replaces hardcoded "Approve & Continue" resumption with HumanSimulator.respond()
- Each kiro turn's output is fed to the simulator; its response resumes the session
- render_prompt() called with with_simulator=True so executor pauses at gates
  instead of self-approving

claude_code_sdk.py:
- Replaces inline _run_simulator_turn() loop with HumanSimulator.respond()
- _SIMULATOR_TOOLS constant removed (owned by HumanSimulator now)
- Dead tech_env_section / openapi_section string-building removed

prompt_template.py:
- {approval_rule} placeholder replaces hardcoded self-approve instruction
- _SELF_APPROVE_RULE / _SIMULATOR_HANDOFF_RULE constants
- render_prompt(with_simulator=True) injects the handoff rule

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
run_cli_evaluation.py:
- Add --simulator-model flag; resolve from config models.simulator.model_id
  when not provided (same pattern as --scorer-model)
- Pass simulator_model through to run_cli_evaluation()

orchestrator.py:
- run_cli_evaluation() accepts simulator_model parameter
- Populates AdapterConfig.simulator_model so HumanSimulator uses the
  same model as the Strands swarm simulator instead of falling back
  to the executor model

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
All three execution modes now use the same HumanSimulator implementation:

execution/runner.py:
- _make_simulator_tool() factory creates a Strands @tool wrapping
  HumanSimulator.respond() — the executor calls handoff_to_simulator
  as a direct tool instead of routing via a second Swarm agent
- Swarm remains single-agent [executor]; MetricsCollector unchanged
- Strands simulator Agent (create_simulator) removed from runner.py

execution/agents/executor.py:
- create_executor() accepts optional simulator_tool parameter and
  appends it to the executor's tool list when provided

execution/pyproject.toml:
- Add anthropic[bedrock]>=0.40 and boto3 deps (needed by HumanSimulator)

Result: vision + tech_env + openapi are injected into the same
build_simulator_system_prompt() for full, cli (claude-code-sdk),
and cli (kiro-cli) modes. One implementation, three entry points.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ll mode

HumanSimulator now accumulates token usage per respond() call in
_input_tokens/_output_tokens etc., exposed via accumulated_usage property
using the same camelCase key format as Strands accumulated_usage.

runner.py:
- _make_simulator_tool() returns (tool, simulator_instance) so the
  instance can be inspected after the swarm completes
- After swarm finishes, calls collector.record_simulator_usage() with
  simulator_instance.accumulated_usage to capture SDK tokens separately

metrics.py:
- MetricsCollector gains _simulator_usage field and record_simulator_usage()
- build_metrics() injects _simulator_usage into per_agent["simulator"] so
  the output shape matches the old two-agent Swarm (executor + simulator
  buckets), with the simulator bucket now sourced from Anthropic SDK usage
  rather than Strands accumulated_usage

Result: run-metrics.yaml tokens.per_agent has distinct "executor" and
"simulator" entries with no mixing, for all three execution modes.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Was removed during cleanup but still referenced in the post-run test
sandbox detection block. Now imported explicitly from the shared package.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Matches claude-code-sdk adapter — calls run_post_evaluation() after
normalize_output() so test-results.yaml is produced and stage 2 is
reported in the evaluation summary.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Generated code targets Python 3.13 (via pyproject.toml requires-python
>=3.13). The 3.14 base caused uv venv to pick up the 3.14 interpreter
which broke uvicorn startup in post-run tests and contract tests.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…reviews

Previous approach used --no-interactive + --resume loop, but kiro ignores
the pause instruction in the prompt and runs the entire workflow in one
turn when --no-interactive is set.

New approach: run kiro without --no-interactive, drive it via stdin.
- Single persistent kiro process with stdin=PIPE, stdout=PIPE
- Send initial prompt to stdin; read output with idle_timeout_s=8.0
- When kiro goes idle (waiting for input), call HumanSimulator.respond()
  with a prompt to read aidlc-docs and provide feedback
- Write simulator response back to kiro's stdin
- Detect construction completion via aidlc-docs/construction/*.md
- Send /quit to close the session cleanly

This gives the simulator genuine review opportunities at each AIDLC gate
rather than having kiro race through the entire workflow unreviewed.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
select.select on macOS with text-mode subprocess pipes deadlocks when
the pipe buffer fills — kiro produces output but readline() blocks
forever because select never returns ready.

Fix: bufsize=0 (unbuffered bytes), background reader thread pushes
raw chunks onto a queue.Queue, _read_until_idle() drains with a
queue.get(timeout=idle_s) which reliably detects silence on all
platforms. All stdin writes updated to encode() for bytes mode.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…t on idle

_read_until_idle() previously only returned when kiro went quiet for
idle_timeout_s seconds. If kiro kept streaming output after completing
the AIDLC workflow, the adapter would never detect completion.

Now checks _is_complete() (construction/*.md exists) after processing
each chunk, so the adapter exits the read loop as soon as the workflow
is done regardless of whether kiro has stopped talking.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…le wait timeout

/quit is not a valid kiro command — the process wouldn't exit, causing
process.wait(timeout=5) to hang and eventually trigger the 7200s timeout.

Now uses process.kill() with a guarded wait on completion and in the
outer cleanup block. total_rc is set to 0 when we initiate the kill
(workflow completed successfully).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Previously kiro's output was only written to the session log (or shown
in full with --verbose). Now meaningful lines (stage transitions, file
creations, thinking steps, responses) are printed to stderr via _log
with [kiro] prefix, filtered to skip spinner frames, credit footers,
and deduplicated consecutive lines.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ighten skip list

Previous approach split on newlines in chunks, but kiro streams
word-by-word so most chunks had no newlines and every word appeared
as a separate 'line'. Now accumulates chars into _line_buf and flushes
on '\n', giving complete lines to the filter.

Extended skip list to suppress spinners, box-drawing, ANSI remnants,
help text, and credit/model footers. Only substantive content (stage
transitions, file operations, responses) reaches stderr.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…review

Root cause: kiro's steering file drives it to full completion regardless
of the user prompt, so interactive stdin and pause instructions are
ignored. The only reliable gate mechanism is kiro's --resume flag.

New approach:
- Phase 1: run kiro with --no-interactive; prompt instructs it to
  execute INCEPTION ONLY and stop after execution-plan.md is written
- Simulator: reads inception artifacts from aidlc-docs/inception/,
  reviews requirements/design/plan, provides feedback and direction
- Phase 2: resume kiro with --resume [simulator feedback]; prompt
  directs it to proceed with Construction using the feedback

This gives the simulator a genuine review of the inception artifacts
before construction begins, using kiro's native conversation
continuation rather than fighting its steering rules.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Replaces the single inception/construction two-phase split with four
individual stage gates, each backed by a simulator review:

Gate 1 — Requirements Analysis
  Kiro writes requirements.md + requirement-verification-questions.md
  Simulator: answers verification questions, approves requirements

Gate 2 — Workflow Planning + Application Design
  Kiro writes execution-plan.md + full application design docs
  Simulator: approves workflow plan and architecture

Gate 3 — Code Generation Plan
  Kiro writes the code generation plan (no code yet)
  Simulator: approves plan before any code is written

Gate 4 — Code Generation + Build and Test
  Kiro generates all code, runs tests, writes build summary
  (No simulator gate after — construction is final)

Each stage uses --no-interactive + --resume so kiro exits cleanly at
the sentinel file boundary, the simulator reviews aidlc-docs/, and
the feedback is injected into the next --resume prompt.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…lator

Plugin adapter registration (registry.py):
- register_adapter(name, fqn) adds an adapter at runtime without editing
  framework code
- load_adapters_from_config(cfg_data) reads cli.adapters from config YAML
  and registers each entry; called from run_cli_evaluation.py after config load
- Built-in adapters remain in the default map unchanged

config/default.yaml:
- cli.adapters: {} extension point documented and ready for custom entries

Shared HumanSimulator (orchestrator → AdapterConfig → adapters):
- AdapterConfig gains simulator: HumanSimulator | None field (TYPE_CHECKING
  guard prevents circular import at runtime)
- orchestrator.run_cli_evaluation() constructs HumanSimulator once with
  full document context (vision, tech_env, openapi) and injects into config
- kiro_cli.py: reads config.simulator instead of building locally — removes
  9 lines of duplicate vision/tech_env reads and HumanSimulator construction
- claude_code_sdk.py: reads config.simulator instead of building locally —
  removes 9 lines of duplicate construction; executor Bedrock client
  (separate from simulator) is still built locally as before

Both adapters now raise RuntimeError if config.simulator is None, making
the dependency explicit rather than silently falling back to no-review.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
ARCHITECTURE.md:
- New section 6.5 CLI Evaluation: documents the cli harness pipeline,
  HumanSimulator injection pattern, simulator gate approach, and plugin
  registration
- New cookbook section "Adding a New CLI Adapter" with a complete worked
  example (Step 1 implement, Step 2 register in config, Step 3 verify)
  and a contracts table covering CLIAdapter, simulator, normalizer,
  post-run tests, and document context fields
- Cross-references plugin registration anchor

CONTRIBUTING.md:
- Updated package list to include cli-harness, ide-harness, trend-reports
- Updated work streams table with CLI Adapters, IDE Adapters, Trend Reporting rows

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@github-advanced-security github-advanced-security AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semgrep OSS found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

harmjeff and others added 4 commits April 30, 2026 14:57
…dings

Semgrep flags the closing paren of multi-line subprocess calls when the
suppression comment is on the preceding line. Moved all suppressions to
inline comments on the subprocess.run/Popen line itself so semgrep
correctly associates the suppression with the finding.

Affected:
- run.py: check_docker_sandbox() — two container CLI info/images calls
- run_git_compare.py:372 — run_evaluation.py subprocess call
- kiro_cli.py:181 — Popen for kiro-cli chat subprocess
- claude_code_sdk.py:312 — run_command tool subprocess.run

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…xtures

test_credential_scrubber.py intentionally contains fake credentials
(example JWT, placeholder GitHub tokens, dummy API keys) to test the
scrubbing logic. Add .gitleaks.toml to suppress these false positives.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…lowlist

ARCHITECTURE.md: align all table cell widths to match separator row
exactly so MD060 table-column-style passes with the repo's 'aligned'
style config. Replaced Unicode em dashes with ASCII hyphens in table
cells to avoid byte-vs-char width discrepancy.

.gitleaks.toml: suppress false positives in test_credential_scrubber.py
which intentionally uses fake JWT/GitHub tokens to test scrubbing logic.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
CI runs semgrep with --config=r/all which uses the full registry rule ID:
  python.lang.security.audit.dangerous-subprocess-use-audit.dangerous-subprocess-use-audit

Our previous suppressions used the short form 'dangerous-subprocess-use-audit'
which only matches local/custom configs. Updated all five suppression
comments to use the full dotted rule ID so CI correctly ignores them.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands scripts/aidlc-evaluator/ with (1) a new git-ref comparison runner and reports, and (2) a CLI-evaluation architecture that standardizes “human reviewer” simulation across execution modes via a shared HumanSimulator, plus plugin-style CLI adapter registration.

Changes:

  • Add git-compare / git-compare-report modes with markdown+HTML reporting and (optional) parallel runs.
  • Introduce a shared Anthropic SDK–based HumanSimulator used by Strands execution, claude-code-sdk, and kiro-cli stage gates.
  • Add config-driven CLI adapter registration (cli.adapters) and wire it into the CLI evaluation entrypoint.

Reviewed changes

Copilot reviewed 28 out of 29 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
scripts/aidlc-evaluator/uv.lock Adds new deps (Anthropic SDK + boto3) to workspace lockfile.
scripts/aidlc-evaluator/scripts/run_git_compare.py New git-ref comparison runner + report generation + optional parallel execution.
scripts/aidlc-evaluator/scripts/run_evaluation.py Adds --rules-repo, changes run-folder allocation behavior for stage execution.
scripts/aidlc-evaluator/scripts/run_cli_evaluation.py Loads adapters from config and adds --simulator-model plumbed into orchestrator.
scripts/aidlc-evaluator/scripts/regenerate_git_compare_report.py New utility to regenerate git-compare reports without re-running evaluations.
scripts/aidlc-evaluator/scripts/generate_html_report.py New interactive HTML report generator (Chart.js) for git-compare results.
scripts/aidlc-evaluator/run.py Adds new modes and container sandbox preflight (docker/podman).
scripts/aidlc-evaluator/pyproject.toml Adds aidlc-cli-harness workspace dependency.
scripts/aidlc-evaluator/packages/shared/src/shared/sandbox.py Adds container runtime auto-detection (docker/podman/finch) for sandbox runs.
scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/runner.py Switches simulator to a tool-based HumanSimulator integration; updates run-folder creation semantics.
scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/metrics.py Adds separate simulator token usage bucket to metrics.
scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/cli.py Adds --openapi passthrough into runner for simulator contract-awareness.
scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/agents/simulator.py Extracts build_simulator_system_prompt() and adds OpenAPI contract injection.
scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/agents/executor.py Allows injecting simulator tool into executor tools.
scripts/aidlc-evaluator/packages/execution/pyproject.toml Adds Anthropic SDK + boto3 dependencies.
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/simulator.py New shared Anthropic Bedrock HumanSimulator with file tools and usage tracking.
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/registry.py Adds adapter registration + config-based adapter loading.
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/prompt_template.py Adds OpenAPI injection and “pause for reviewer” vs “self-approve” prompt modes.
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/orchestrator.py Constructs and injects shared HumanSimulator; fixes run_evaluation path.
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/adapters/kiro_cli.py Adds per-stage simulator gates and post-run tests integration.
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/adapters/claude_code_sdk.py New Anthropic SDK adapter with inline simulator handoffs + tool loop.
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/adapters/claude_code.py Injects OpenAPI into prompt for the subprocess-based adapter.
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/adapter.py Extends AdapterConfig with simulator/openapi/aws fields.
scripts/aidlc-evaluator/packages/cli-harness/pyproject.toml Adds Anthropic SDK + boto3 dependencies.
scripts/aidlc-evaluator/docker/sandbox/Dockerfile Pins sandbox image to Python 3.13.
scripts/aidlc-evaluator/config/default.yaml Adds cli.adapters extension point.
scripts/aidlc-evaluator/CONTRIBUTING.md Updates package/work-stream documentation.
scripts/aidlc-evaluator/ARCHITECTURE.md Documents CLI evaluation architecture and plugin adapter cookbook.
scripts/aidlc-evaluator/.gitleaks.toml Adds gitleaks allowlist for known fake-credential fixtures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


from aidlc_runner.agents.executor import create_executor
from aidlc_runner.agents.simulator import create_simulator
from aidlc_runner.agents.simulator import build_simulator_system_prompt
Comment on lines +484 to +493
session_kwargs: dict = {}
if config.aws_profile:
session_kwargs["profile_name"] = config.aws_profile
boto_session = boto3.Session(**session_kwargs)
frozen = boto_session.get_credentials().get_frozen_credentials()
client = anthropic.AnthropicBedrock(
aws_access_key=frozen.access_key,
aws_secret_key=frozen.secret_key,
aws_session_token=frozen.token,
aws_region=aws_region,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a valid check here. What are your thoughts?

Comment thread scripts/aidlc-evaluator/scripts/run_evaluation.py
"--scorer-model", scorer_model,
"--rules-ref", version.ref,
"--report-format", "both",
"--output-dir", str(run_folder), # Pass full folder path, not parent dir
Comment thread scripts/aidlc-evaluator/scripts/generate_html_report.py
Comment thread scripts/aidlc-evaluator/scripts/run_git_compare.py

@property
def accumulated_usage(self) -> dict[str, int]:
"""Token totals across all respond() calls, keyed by snake_case names
Comment on lines +75 to +82
def _container_cli() -> str | None:
"""Return the first available container CLI: docker or podman."""
import shutil
for cli in ("docker", "podman"):
if shutil.which(cli):
return cli
return None

Also writes a sentinel file (``{output_dir}/.last_run_folder``) containing
the absolute path of the new run folder so that parent orchestrators can
discover the folder without racy before/after directory listing.
Also writes a sentinel file (``{output_dir.parent}/.last_run_folder``) in
Semgrep requires the suppression comment to be on the exact line of the
finding. Preceding-line comments are not reliably associated with the
call when both the finding and suppression are new (introduced in this PR).

Moved all five nosemgrep suppressions to inline on the subprocess.run()
/ Popen() line itself, using the full rule ID:
  python.lang.security.audit.dangerous-subprocess-use-audit.dangerous-subprocess-use-audit

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@harmjeff harmjeff added the dependencies Pull requests that update a dependency file label May 1, 2026
@harmjeff harmjeff enabled auto-merge May 1, 2026 14:27
harmjeff and others added 2 commits May 1, 2026 11:25
Documents the one-pass scan sequence, correct nosemgrep inline syntax,
full rule IDs required for CI (--config=r/all), and how to distinguish
the live 'semgrep' CI job from the stale 'Semgrep OSS' code scanning
annotations.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
# Multi-language sandbox image for running AI-generated code in isolation.
#
# Includes Python 3.14 + uv, Node.js 22 + npm, and common build tools.
# Includes Python 3.13 + uv, Node.js 22 + npm, and common build tools.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was there something that caused 3.14 -> 3.13?

Comment on lines +15 to +19
import json
import logging
import os
import shlex
import subprocess
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There a RCE and vulnerabilities by importing these at the top level. All inputs and outputs must be treated as untrusted.

return resolved.read_text(encoding="utf-8")

elif name == "run_command":
command = tool_input["command"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How might the tool's input be validated?

Copy link
Copy Markdown
Member

@scottschreckengaust scottschreckengaust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Change inline full-rule-ID nosemgrep comments to preceding-line
short-name format (# nosemgrep: dangerous-subprocess-use-audit),
matching the pattern used throughout the rest of the codebase that
the Semgrep OSS GitHub App correctly honours.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@Kalindi-Dev Kalindi-Dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harmjeff
Few suggestions:

1. Frozen credentials can expire during long runs

frozen = boto_session.get_credentials().get_frozen_credentials()
client = anthropic.AnthropicBedrock(
    aws_access_key=frozen.access_key,
    aws_secret_key=frozen.secret_key,
    aws_session_token=frozen.token,
    ...
)

With a 2-hour timeout and STS session tokens often lasting 1 hour, the credentials could expire mid-run. The AnthropicBedrock client doesn't refresh them. This affects both the executor and simulator.


2. Token counting may overcount

@property
def total(self) -> int:
    return self.input_tokens + self.output_tokens + self.cache_read_tokens + self.cache_write_tokens

Cache read/write tokens are typically a breakdown of input tokens (not additive). Summing all four could inflate totals.


3. No unit tests for 800+ lines of new code

The PR adds claude_code_sdk.py (587 lines), simulator.py (229 lines), and significantly rewrites kiro_cli.py — but no test files. The tool execution logic (_exec_tool, _resolve_safe), the token tracking, and the stage-gate logic
are all testable in isolation without LLM calls.

harmjeff and others added 3 commits May 5, 2026 18:04
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Comment thread scripts/aidlc-evaluator/scripts/run_git_compare.py Fixed
…re.py

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Comment thread scripts/aidlc-evaluator/scripts/run_git_compare.py Fixed
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@leandrodamascena leandrodamascena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @harmjeff, excellent work on this PR. I created a new adapter for OpenCode locally and it worked really well. The CLIAdapter interface + plugin registry made it trivial to add a new CLI tool without touching any framework code, this is really nice.

I ran both claude-code and opencode against the sci-calc test case and results are pretty similar. The only issue I found was with token usage reporting for OpenCode, but I know why and how to fix.

I have the adapter file ready if you want to include it in this PR or as a follow-up.

After you fix the bug with --no-sandbox flag, we are good to merge and then work with @scottschreckengaust in another PR to add this in CodeBuild.

report_opencode.html

report_claude.html

Comment on lines +250 to +270
sandbox_disabled = "--no-sandbox" in remaining
if args.mode in DOCKER_DEPENDENT_MODES and not sandbox_disabled:
if not check_docker_sandbox():
print("=" * 70, file=sys.stderr)
print("ERROR: Docker sandbox image not found", file=sys.stderr)
print("=" * 70, file=sys.stderr)
print(file=sys.stderr)
print("The evaluation framework requires the Docker sandbox image", file=sys.stderr)
print("'aidlc-sandbox:latest' to run generated code safely.", file=sys.stderr)
print(file=sys.stderr)
print("To build the image, run:", file=sys.stderr)
print(" ./docker/sandbox/build.sh", file=sys.stderr)
print(file=sys.stderr)
print("Or manually:", file=sys.stderr)
print(" docker build -t aidlc-sandbox:latest docker/sandbox/", file=sys.stderr)
print(file=sys.stderr)
print("To run without Docker (not recommended for untrusted code),", file=sys.stderr)
print("set 'execution.sandbox.enabled: false' in config/default.yaml", file=sys.stderr)
print("=" * 70, file=sys.stderr)
sys.exit(1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Jeff, I was testing the CLI adapter locally and I found a bug with --no-sandbox. The flag is checked in run.py to skip the Docker preflight, but it's forwarded unchanged to the sub-script via cmd.extend(remaining). Since run_cli_evaluation.py doesn't recognize it, you get unrecognized arguments: --no-sandbox. Without Docker you're stuck either way.

Quick fix in scripts/aidlc-evaluator/run.py:

     sandbox_disabled = "--no-sandbox" in remaining
     if args.mode in DOCKER_DEPENDENT_MODES and not sandbox_disabled:
         ...

+    if sandbox_disabled:
+        remaining = [arg for arg in remaining if arg != "--no-sandbox"]
+
     cmd = [sys.executable, str(script)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants