feat(evaluator): parity sync, Claude SDK adapter, kiro per-stage gates, shared simulator by harmjeff · Pull Request #235 · awslabs/aidlc-workflows

harmjeff · 2026-04-30T18:12:26Z

Summary

Brings scripts/aidlc-evaluator to parity with the internal GitLab baseline and adds significant new CLI evaluation capabilities: a Claude SDK adapter with mid-workflow simulator handoffs, kiro-cli per-stage reviewer gates, a shared HumanSimulator injected by the orchestrator, and a plugin adapter registration system. All changes are scoped to scripts/aidlc-evaluator/.

Changes

Parity sync (GitLab → GitHub)

run.py: add git-compare / git-compare-report modes; Docker/Podman/Finch preflight check
scripts/run_evaluation.py: restore --rules-repo flag; deterministic run folder pre-allocation (eliminates sentinel file races, enables parallel execution)
Add scripts/run_git_compare.py, regenerate_git_compare_report.py, generate_html_report.py

Claude SDK adapter (`--cli claude-code-sdk`)

New ClaudeCodeSDKAdapter drives the AIDLC executor via anthropic.AnthropicBedrock turn-by-turn
Intercepts handoff_to_simulator tool calls to inject human-analog reviews mid-workflow
Reuses EXECUTOR_SYSTEM_PROMPT from packages/execution — no prompt duplication
Post-run tests wired in; separate executor/simulator token buckets in metrics

Kiro-cli per-stage simulator gates

4 explicit stage gates: Requirements → Design → Code-gen plan → Construction
Uses kiro's native --no-interactive + --resume to inject simulator feedback between stages

Shared HumanSimulator

Single HumanSimulator in packages/cli-harness/src/cli_harness/simulator.py used by all three execution modes (Strands swarm, claude-code-sdk, kiro-cli)
Orchestrator constructs it once with full document context (vision + tech_env + OpenAPI) and injects via AdapterConfig.simulator — adapters no longer construct it themselves
build_simulator_system_prompt() is the single source of truth in packages/execution

Plugin adapter registration

registry.register_adapter(name, fqn) and load_adapters_from_config(cfg_data) allow adding CLI adapters via config/default.yaml with no framework code changes
config/default.yaml gains cli.adapters: {} extension point

Infrastructure fixes

shared/sandbox.py: auto-detect docker, podman, or finch as container runtime
docker/sandbox/Dockerfile: pin to Python 3.13 (was 3.14, causing uvicorn startup failures in generated code)
packages/execution/runner.py: Mode 1 direct-folder path for orchestrator-specified run folders
run.py: --no-sandbox skips container preflight; podman recognised alongside docker

Documentation

ARCHITECTURE.md: new section 6.5 CLI Evaluation + full "Adding a New CLI Adapter" cookbook with worked example and contracts table
CONTRIBUTING.md: updated package list and work streams table to include cli-harness, ide-harness, trend-reports

User experience

Before: The evaluator only supported the Strands two-agent swarm for full runs, with no programmatic CLI adapter having a human reviewer. The claude-code adapter ran as a one-shot subprocess with no simulation.

After: Three execution modes all use the same human-analog simulator with full document context:

run.py full — Strands swarm with HumanSimulator tool
run.py cli --cli claude-code-sdk — SDK-driven executor with inline simulator handoffs
run.py cli --cli kiro-cli — kiro subprocess with 4 per-stage simulator review gates

New adapters can be registered with one config line and zero framework changes.

Checklist

I have reviewed the contributing guidelines
I have performed a self-review of this change
Changes have been tested
Changes are documented

Test Plan

cd scripts/aidlc-evaluator
uv sync

# Verify all adapters registered and ready
uv run python run.py cli --list

# Run all three modes (can be run in parallel)
uv run python run.py full --no-sandbox
uv run python run.py cli --cli claude-code-sdk \
  --vision test_cases/sci-calc/vision.md \
  --golden test_cases/sci-calc/golden-aidlc-docs \
  --openapi test_cases/sci-calc/openapi.yaml \
  --rules-path <repo-root>
uv run python run.py cli --cli kiro-cli \
  --vision test_cases/sci-calc/vision.md \
  --golden test_cases/sci-calc/golden-aidlc-docs \
  --openapi test_cases/sci-calc/openapi.yaml \
  --rules-path <repo-root>

Expected results for each mode:

Post-run tests: PASS
Contract tests: 88/88 PASS
Qualitative score: ~0.75–0.80

Acknowledgment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

…mulator Parity with gitlab/aidlc-regression: - run.py: add git-compare / git-compare-report modes, Docker sandbox preflight check (DOCKER_DEPENDENT_MODES + check_docker_sandbox()) - run_evaluation.py: restore --rules-repo flag, pass through to aidlc_runner, save to evaluation-config.yaml; add direct-folder shortcut in stage_execute() so git-compare orchestrator can specify exact timestamped output paths - scripts/: add run_git_compare.py, regenerate_git_compare_report.py, generate_html_report.py (hard dependency of git-compare) New: claude-code-sdk adapter (packages/cli-harness): - ClaudeCodeSDKAdapter drives the executor via anthropic.AnthropicBedrock instead of `claude -p` subprocess, enabling interactive mid-workflow handoffs to an embedded Human Simulator - Reuses EXECUTOR_SYSTEM_PROMPT and SIMULATOR_SYSTEM_PROMPT_TEMPLATE verbatim from packages/execution — no prompt duplication - Simulator is injected on each handoff_to_simulator tool call; only file tools (read/write/list) available to simulator, not run_command - Token usage tracked separately for executor and simulator buckets; compatible with existing run-metrics.yaml schema - AdapterConfig gains simulator_model and aws_region fields - orchestrator.py passes aws_region into AdapterConfig - cli-harness pyproject.toml adds anthropic[bedrock]>=0.40, boto3>=1.42.47 Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

… with --no-sandbox check_docker_sandbox() now tries podman if docker is not on PATH. The preflight is skipped entirely when --no-sandbox is passed so the evaluator can run on hosts without a container runtime. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…r allocation Container runtime (sandbox.py): - Replace all hardcoded 'docker' CLI calls with _get_container_cli() which probes docker, podman, and finch in order and caches the result - is_docker_available() now works with any of the three runtimes - sandbox_run/run_detached/stop/is_running/logs all use the detected CLI Run folder allocation (run_evaluation.py): - stage_execute() now pre-allocates the exact timestamped run folder (same {timestamp}-{slug} format as runner.py) before invoking aidlc_runner, then passes it as --output-dir so the runner uses it directly (Mode 1) - Eliminates sentinel file reading, before/after directory diffing, and the direct-folder heuristic — the path is deterministic from the start - Enables safe parallel execution: each caller pre-allocates its own folder - Removed now-dead helpers: _SENTINEL_NAME, _read_run_sentinel, _list_run_folders, _find_new_run - Added _rules_slug() helper to mirror runner.py slug generation - stage_execute() signature gains cfg_data: dict parameter Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Without this, anthropic[bedrock] was not installed in the shared venv and the claude-code-sdk adapter could not be loaded. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…path runner.py: - Add Mode 1: if --output-dir is already a timestamped folder name, use it directly instead of creating a nested subfolder — required for the deterministic run folder pre-allocation in run_evaluation.py to work - Sentinel write now only happens in Mode 2 (new subfolder) - mkdir calls use exist_ok=True for idempotency claude_code_sdk.py: - Use config.rules_path directly instead of re-copying rules (orchestrator already set them up at output_dir/aidlc-rules before calling adapter.run()) - Fix credentials: use get_frozen_credentials() not .resolve() - Fix default model: global.anthropic.claude-opus-4-6-v1 (not a fake 4-7 ID) - Handle missing 'content' key in write_file tool input gracefully orchestrator.py: - Fix run_evaluation.py path: scripts/run_evaluation.py not run_evaluation.py at repo root Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Calls run_post_evaluation() from packages/execution after the executor loop completes, matching the behaviour of the Strands runner. Detects the project type in workspace/, installs deps, runs tests, and writes test-results.yaml. Sandbox is enabled when a container runtime (docker/podman/finch) is available. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

All three execution modes now give the human-analog agent full visibility into the OpenAPI contract during design reviews and code review handoffs. Strands swarm (full mode): - simulator.py: SIMULATOR_SYSTEM_PROMPT_TEMPLATE gains {openapi_section} placeholder; create_simulator() accepts openapi_content parameter and renders a binding API contract section into the system prompt - runner.py: run() accepts openapi_path, reads it, passes to create_simulator() - cli.py: --openapi flag added; forwarded to run() - run_evaluation.py stage_execute(): passes --openapi to aidlc_runner when present Claude SDK adapter (cli mode, claude-code-sdk): - Reads config.openapi_content and renders the same API contract section into the simulator system prompt before the executor loop starts CLI subprocess adapters (claude-code, kiro-cli): - prompt_template.py: render_prompt() accepts openapi_content; injects it as a binding contract section so the self-approving executor has the full spec in view during design and code generation - Both adapters pass config.openapi_content to render_prompt() Plumbing: - AdapterConfig gains openapi_content: str | None field - orchestrator.run_cli_evaluation() reads openapi_path and populates it Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

simulator.py (execution package): - Extract build_simulator_system_prompt() as a standalone function so other adapters can construct the same prompt without Strands dependencies - create_simulator() now delegates to it (no behaviour change) simulator.py (cli-harness, new): - HumanSimulator class: Anthropic SDK-based reviewer backed by build_simulator_system_prompt() — single implementation used by both kiro-cli and claude-code-sdk - HumanSimulator.from_adapter_config() constructs from AdapterConfig fields - HumanSimulator.respond() runs the turn loop with file tool support - _exec_file_tool() provides read/write/list scoped to run_folder kiro_cli.py: - Replaces hardcoded "Approve & Continue" resumption with HumanSimulator.respond() - Each kiro turn's output is fed to the simulator; its response resumes the session - render_prompt() called with with_simulator=True so executor pauses at gates instead of self-approving claude_code_sdk.py: - Replaces inline _run_simulator_turn() loop with HumanSimulator.respond() - _SIMULATOR_TOOLS constant removed (owned by HumanSimulator now) - Dead tech_env_section / openapi_section string-building removed prompt_template.py: - {approval_rule} placeholder replaces hardcoded self-approve instruction - _SELF_APPROVE_RULE / _SIMULATOR_HANDOFF_RULE constants - render_prompt(with_simulator=True) injects the handoff rule Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

run_cli_evaluation.py: - Add --simulator-model flag; resolve from config models.simulator.model_id when not provided (same pattern as --scorer-model) - Pass simulator_model through to run_cli_evaluation() orchestrator.py: - run_cli_evaluation() accepts simulator_model parameter - Populates AdapterConfig.simulator_model so HumanSimulator uses the same model as the Strands swarm simulator instead of falling back to the executor model Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

@tool

All three execution modes now use the same HumanSimulator implementation: execution/runner.py: - _make_simulator_tool() factory creates a Strands @tool wrapping HumanSimulator.respond() — the executor calls handoff_to_simulator as a direct tool instead of routing via a second Swarm agent - Swarm remains single-agent [executor]; MetricsCollector unchanged - Strands simulator Agent (create_simulator) removed from runner.py execution/agents/executor.py: - create_executor() accepts optional simulator_tool parameter and appends it to the executor's tool list when provided execution/pyproject.toml: - Add anthropic[bedrock]>=0.40 and boto3 deps (needed by HumanSimulator) Result: vision + tech_env + openapi are injected into the same build_simulator_system_prompt() for full, cli (claude-code-sdk), and cli (kiro-cli) modes. One implementation, three entry points. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…ll mode HumanSimulator now accumulates token usage per respond() call in _input_tokens/_output_tokens etc., exposed via accumulated_usage property using the same camelCase key format as Strands accumulated_usage. runner.py: - _make_simulator_tool() returns (tool, simulator_instance) so the instance can be inspected after the swarm completes - After swarm finishes, calls collector.record_simulator_usage() with simulator_instance.accumulated_usage to capture SDK tokens separately metrics.py: - MetricsCollector gains _simulator_usage field and record_simulator_usage() - build_metrics() injects _simulator_usage into per_agent["simulator"] so the output shape matches the old two-agent Swarm (executor + simulator buckets), with the simulator bucket now sourced from Anthropic SDK usage rather than Strands accumulated_usage Result: run-metrics.yaml tokens.per_agent has distinct "executor" and "simulator" entries with no mixing, for all three execution modes. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Was removed during cleanup but still referenced in the post-run test sandbox detection block. Now imported explicitly from the shared package. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Matches claude-code-sdk adapter — calls run_post_evaluation() after normalize_output() so test-results.yaml is produced and stage 2 is reported in the evaluation summary. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Generated code targets Python 3.13 (via pyproject.toml requires-python >=3.13). The 3.14 base caused uv venv to pick up the 3.14 interpreter which broke uvicorn startup in post-run tests and contract tests. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…reviews Previous approach used --no-interactive + --resume loop, but kiro ignores the pause instruction in the prompt and runs the entire workflow in one turn when --no-interactive is set. New approach: run kiro without --no-interactive, drive it via stdin. - Single persistent kiro process with stdin=PIPE, stdout=PIPE - Send initial prompt to stdin; read output with idle_timeout_s=8.0 - When kiro goes idle (waiting for input), call HumanSimulator.respond() with a prompt to read aidlc-docs and provide feedback - Write simulator response back to kiro's stdin - Detect construction completion via aidlc-docs/construction/*.md - Send /quit to close the session cleanly This gives the simulator genuine review opportunities at each AIDLC gate rather than having kiro race through the entire workflow unreviewed. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

select.select on macOS with text-mode subprocess pipes deadlocks when the pipe buffer fills — kiro produces output but readline() blocks forever because select never returns ready. Fix: bufsize=0 (unbuffered bytes), background reader thread pushes raw chunks onto a queue.Queue, _read_until_idle() drains with a queue.get(timeout=idle_s) which reliably detects silence on all platforms. All stdin writes updated to encode() for bytes mode. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…t on idle _read_until_idle() previously only returned when kiro went quiet for idle_timeout_s seconds. If kiro kept streaming output after completing the AIDLC workflow, the adapter would never detect completion. Now checks _is_complete() (construction/*.md exists) after processing each chunk, so the adapter exits the read loop as soon as the workflow is done regardless of whether kiro has stopped talking. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…le wait timeout /quit is not a valid kiro command — the process wouldn't exit, causing process.wait(timeout=5) to hang and eventually trigger the 7200s timeout. Now uses process.kill() with a guarded wait on completion and in the outer cleanup block. total_rc is set to 0 when we initiate the kill (workflow completed successfully). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Previously kiro's output was only written to the session log (or shown in full with --verbose). Now meaningful lines (stage transitions, file creations, thinking steps, responses) are printed to stderr via _log with [kiro] prefix, filtered to skip spinner frames, credit footers, and deduplicated consecutive lines. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…ighten skip list Previous approach split on newlines in chunks, but kiro streams word-by-word so most chunks had no newlines and every word appeared as a separate 'line'. Now accumulates chars into _line_buf and flushes on '\n', giving complete lines to the filter. Extended skip list to suppress spinners, box-drawing, ANSI remnants, help text, and credit/model footers. Only substantive content (stage transitions, file operations, responses) reaches stderr. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…review Root cause: kiro's steering file drives it to full completion regardless of the user prompt, so interactive stdin and pause instructions are ignored. The only reliable gate mechanism is kiro's --resume flag. New approach: - Phase 1: run kiro with --no-interactive; prompt instructs it to execute INCEPTION ONLY and stop after execution-plan.md is written - Simulator: reads inception artifacts from aidlc-docs/inception/, reviews requirements/design/plan, provides feedback and direction - Phase 2: resume kiro with --resume [simulator feedback]; prompt directs it to proceed with Construction using the feedback This gives the simulator a genuine review of the inception artifacts before construction begins, using kiro's native conversation continuation rather than fighting its steering rules. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Replaces the single inception/construction two-phase split with four individual stage gates, each backed by a simulator review: Gate 1 — Requirements Analysis Kiro writes requirements.md + requirement-verification-questions.md Simulator: answers verification questions, approves requirements Gate 2 — Workflow Planning + Application Design Kiro writes execution-plan.md + full application design docs Simulator: approves workflow plan and architecture Gate 3 — Code Generation Plan Kiro writes the code generation plan (no code yet) Simulator: approves plan before any code is written Gate 4 — Code Generation + Build and Test Kiro generates all code, runs tests, writes build summary (No simulator gate after — construction is final) Each stage uses --no-interactive + --resume so kiro exits cleanly at the sentinel file boundary, the simulator reviews aidlc-docs/, and the feedback is injected into the next --resume prompt. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…lator Plugin adapter registration (registry.py): - register_adapter(name, fqn) adds an adapter at runtime without editing framework code - load_adapters_from_config(cfg_data) reads cli.adapters from config YAML and registers each entry; called from run_cli_evaluation.py after config load - Built-in adapters remain in the default map unchanged config/default.yaml: - cli.adapters: {} extension point documented and ready for custom entries Shared HumanSimulator (orchestrator → AdapterConfig → adapters): - AdapterConfig gains simulator: HumanSimulator | None field (TYPE_CHECKING guard prevents circular import at runtime) - orchestrator.run_cli_evaluation() constructs HumanSimulator once with full document context (vision, tech_env, openapi) and injects into config - kiro_cli.py: reads config.simulator instead of building locally — removes 9 lines of duplicate vision/tech_env reads and HumanSimulator construction - claude_code_sdk.py: reads config.simulator instead of building locally — removes 9 lines of duplicate construction; executor Bedrock client (separate from simulator) is still built locally as before Both adapters now raise RuntimeError if config.simulator is None, making the dependency explicit rather than silently falling back to no-review. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

ARCHITECTURE.md: - New section 6.5 CLI Evaluation: documents the cli harness pipeline, HumanSimulator injection pattern, simulator gate approach, and plugin registration - New cookbook section "Adding a New CLI Adapter" with a complete worked example (Step 1 implement, Step 2 register in config, Step 3 verify) and a contracts table covering CLIAdapter, simulator, normalizer, post-run tests, and document context fields - Cross-references plugin registration anchor CONTRIBUTING.md: - Updated package list to include cli-harness, ide-harness, trend-reports - Updated work streams table with CLI Adapters, IDE Adapters, Trend Reporting rows Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

github-advanced-security

Semgrep OSS found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

…dings Semgrep flags the closing paren of multi-line subprocess calls when the suppression comment is on the preceding line. Moved all suppressions to inline comments on the subprocess.run/Popen line itself so semgrep correctly associates the suppression with the finding. Affected: - run.py: check_docker_sandbox() — two container CLI info/images calls - run_git_compare.py:372 — run_evaluation.py subprocess call - kiro_cli.py:181 — Popen for kiro-cli chat subprocess - claude_code_sdk.py:312 — run_command tool subprocess.run Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…xtures test_credential_scrubber.py intentionally contains fake credentials (example JWT, placeholder GitHub tokens, dummy API keys) to test the scrubbing logic. Add .gitleaks.toml to suppress these false positives. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…lowlist ARCHITECTURE.md: align all table cell widths to match separator row exactly so MD060 table-column-style passes with the repo's 'aligned' style config. Replaced Unicode em dashes with ASCII hyphens in table cells to avoid byte-vs-char width discrepancy. .gitleaks.toml: suppress false positives in test_credential_scrubber.py which intentionally uses fake JWT/GitHub tokens to test scrubbing logic. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

CI runs semgrep with --config=r/all which uses the full registry rule ID: python.lang.security.audit.dangerous-subprocess-use-audit.dangerous-subprocess-use-audit Our previous suppressions used the short form 'dangerous-subprocess-use-audit' which only matches local/custom configs. Updated all five suppression comments to use the full dotted rule ID so CI correctly ignores them. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR expands scripts/aidlc-evaluator/ with (1) a new git-ref comparison runner and reports, and (2) a CLI-evaluation architecture that standardizes “human reviewer” simulation across execution modes via a shared HumanSimulator, plus plugin-style CLI adapter registration.

Changes:

Add git-compare / git-compare-report modes with markdown+HTML reporting and (optional) parallel runs.
Introduce a shared Anthropic SDK–based HumanSimulator used by Strands execution, claude-code-sdk, and kiro-cli stage gates.
Add config-driven CLI adapter registration (cli.adapters) and wire it into the CLI evaluation entrypoint.

Reviewed changes

Copilot reviewed 28 out of 29 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
scripts/aidlc-evaluator/uv.lock	Adds new deps (Anthropic SDK + boto3) to workspace lockfile.
scripts/aidlc-evaluator/scripts/run_git_compare.py	New git-ref comparison runner + report generation + optional parallel execution.
scripts/aidlc-evaluator/scripts/run_evaluation.py	Adds `--rules-repo`, changes run-folder allocation behavior for stage execution.
scripts/aidlc-evaluator/scripts/run_cli_evaluation.py	Loads adapters from config and adds `--simulator-model` plumbed into orchestrator.
scripts/aidlc-evaluator/scripts/regenerate_git_compare_report.py	New utility to regenerate git-compare reports without re-running evaluations.
scripts/aidlc-evaluator/scripts/generate_html_report.py	New interactive HTML report generator (Chart.js) for git-compare results.
scripts/aidlc-evaluator/run.py	Adds new modes and container sandbox preflight (docker/podman).
scripts/aidlc-evaluator/pyproject.toml	Adds `aidlc-cli-harness` workspace dependency.
scripts/aidlc-evaluator/packages/shared/src/shared/sandbox.py	Adds container runtime auto-detection (docker/podman/finch) for sandbox runs.
scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/runner.py	Switches simulator to a tool-based HumanSimulator integration; updates run-folder creation semantics.
scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/metrics.py	Adds separate simulator token usage bucket to metrics.
scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/cli.py	Adds `--openapi` passthrough into runner for simulator contract-awareness.
scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/agents/simulator.py	Extracts `build_simulator_system_prompt()` and adds OpenAPI contract injection.
scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/agents/executor.py	Allows injecting simulator tool into executor tools.
scripts/aidlc-evaluator/packages/execution/pyproject.toml	Adds Anthropic SDK + boto3 dependencies.
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/simulator.py	New shared Anthropic Bedrock HumanSimulator with file tools and usage tracking.
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/registry.py	Adds adapter registration + config-based adapter loading.
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/prompt_template.py	Adds OpenAPI injection and “pause for reviewer” vs “self-approve” prompt modes.
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/orchestrator.py	Constructs and injects shared HumanSimulator; fixes run_evaluation path.
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/adapters/kiro_cli.py	Adds per-stage simulator gates and post-run tests integration.
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/adapters/claude_code_sdk.py	New Anthropic SDK adapter with inline simulator handoffs + tool loop.
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/adapters/claude_code.py	Injects OpenAPI into prompt for the subprocess-based adapter.
scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/adapter.py	Extends AdapterConfig with simulator/openapi/aws fields.
scripts/aidlc-evaluator/packages/cli-harness/pyproject.toml	Adds Anthropic SDK + boto3 dependencies.
scripts/aidlc-evaluator/docker/sandbox/Dockerfile	Pins sandbox image to Python 3.13.
scripts/aidlc-evaluator/config/default.yaml	Adds `cli.adapters` extension point.
scripts/aidlc-evaluator/CONTRIBUTING.md	Updates package/work-stream documentation.
scripts/aidlc-evaluator/ARCHITECTURE.md	Documents CLI evaluation architecture and plugin adapter cookbook.
scripts/aidlc-evaluator/.gitleaks.toml	Adds gitleaks allowlist for known fake-credential fixtures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+
 from aidlc_runner.agents.executor import create_executor
-from aidlc_runner.agents.simulator import create_simulator
+from aidlc_runner.agents.simulator import build_simulator_system_prompt


scottschreckengaust · 2026-05-05T17:32:11Z

+            session_kwargs: dict = {}
+            if config.aws_profile:
+                session_kwargs["profile_name"] = config.aws_profile
+            boto_session = boto3.Session(**session_kwargs)
+            frozen = boto_session.get_credentials().get_frozen_credentials()
+            client = anthropic.AnthropicBedrock(
+                aws_access_key=frozen.access_key,
+                aws_secret_key=frozen.secret_key,
+                aws_session_token=frozen.token,
+                aws_region=aws_region,


This seems like a valid check here. What are your thoughts?

+        "--scorer-model", scorer_model,
+        "--rules-ref", version.ref,
+        "--report-format", "both",
+        "--output-dir", str(run_folder),  # Pass full folder path, not parent dir


+
+    @property
+    def accumulated_usage(self) -> dict[str, int]:
+        """Token totals across all respond() calls, keyed by snake_case names


+def _container_cli() -> str | None:
+    """Return the first available container CLI: docker or podman."""
+    import shutil
+    for cli in ("docker", "podman"):
+        if shutil.which(cli):
+            return cli
+    return None
+


-    Also writes a sentinel file (``{output_dir}/.last_run_folder``) containing
-    the absolute path of the new run folder so that parent orchestrators can
-    discover the folder without racy before/after directory listing.
+    Also writes a sentinel file (``{output_dir.parent}/.last_run_folder``) in


Semgrep requires the suppression comment to be on the exact line of the finding. Preceding-line comments are not reliably associated with the call when both the finding and suppression are new (introduced in this PR). Moved all five nosemgrep suppressions to inline on the subprocess.run() / Popen() line itself, using the full rule ID: python.lang.security.audit.dangerous-subprocess-use-audit.dangerous-subprocess-use-audit Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Documents the one-pass scan sequence, correct nosemgrep inline syntax, full rule IDs required for CI (--config=r/all), and how to distinguish the live 'semgrep' CI job from the stale 'Semgrep OSS' code scanning annotations. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…n guidance" This reverts commit 3e20c0a.

scottschreckengaust · 2026-05-05T17:22:55Z

 # Multi-language sandbox image for running AI-generated code in isolation.
 #
-# Includes Python 3.14 + uv, Node.js 22 + npm, and common build tools.
+# Includes Python 3.13 + uv, Node.js 22 + npm, and common build tools.


was there something that caused 3.14 -> 3.13?

scottschreckengaust · 2026-05-05T17:27:45Z

+import json
+import logging
+import os
+import shlex
+import subprocess


There a RCE and vulnerabilities by importing these at the top level. All inputs and outputs must be treated as untrusted.

scottschreckengaust · 2026-05-05T17:30:10Z

+            return resolved.read_text(encoding="utf-8")
+
+        elif name == "run_command":
+            command = tool_input["command"]


How might the tool's input be validated?

scottschreckengaust

LGTM

Change inline full-rule-ID nosemgrep comments to preceding-line short-name format (# nosemgrep: dangerous-subprocess-use-audit), matching the pattern used throughout the rest of the codebase that the Semgrep OSS GitHub App correctly honours. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Kalindi-Dev

@harmjeff
Few suggestions:

1. Frozen credentials can expire during long runs

frozen = boto_session.get_credentials().get_frozen_credentials()
client = anthropic.AnthropicBedrock(
    aws_access_key=frozen.access_key,
    aws_secret_key=frozen.secret_key,
    aws_session_token=frozen.token,
    ...
)

With a 2-hour timeout and STS session tokens often lasting 1 hour, the credentials could expire mid-run. The AnthropicBedrock client doesn't refresh them. This affects both the executor and simulator.

2. Token counting may overcount

@property
def total(self) -> int:
    return self.input_tokens + self.output_tokens + self.cache_read_tokens + self.cache_write_tokens

Cache read/write tokens are typically a breakdown of input tokens (not additive). Summing all four could inflate totals.

3. No unit tests for 800+ lines of new code

The PR adds claude_code_sdk.py (587 lines), simulator.py (229 lines), and significantly rewrites kiro_cli.py — but no test files. The tool execution logic (_exec_tool, _resolve_safe), the token tracking, and the stage-gate logic
are all testable in isolation without LLM calls.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…re.py Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

leandrodamascena

Hey @harmjeff, excellent work on this PR. I created a new adapter for OpenCode locally and it worked really well. The CLIAdapter interface + plugin registry made it trivial to add a new CLI tool without touching any framework code, this is really nice.

I ran both claude-code and opencode against the sci-calc test case and results are pretty similar. The only issue I found was with token usage reporting for OpenCode, but I know why and how to fix.

I have the adapter file ready if you want to include it in this PR or as a follow-up.

After you fix the bug with --no-sandbox flag, we are good to merge and then work with @scottschreckengaust in another PR to add this in CodeBuild.

report_opencode.html

report_claude.html

leandrodamascena · 2026-05-06T22:05:42Z

+    sandbox_disabled = "--no-sandbox" in remaining
+    if args.mode in DOCKER_DEPENDENT_MODES and not sandbox_disabled:
+        if not check_docker_sandbox():
+            print("=" * 70, file=sys.stderr)
+            print("ERROR: Docker sandbox image not found", file=sys.stderr)
+            print("=" * 70, file=sys.stderr)
+            print(file=sys.stderr)
+            print("The evaluation framework requires the Docker sandbox image", file=sys.stderr)
+            print("'aidlc-sandbox:latest' to run generated code safely.", file=sys.stderr)
+            print(file=sys.stderr)
+            print("To build the image, run:", file=sys.stderr)
+            print("  ./docker/sandbox/build.sh", file=sys.stderr)
+            print(file=sys.stderr)
+            print("Or manually:", file=sys.stderr)
+            print("  docker build -t aidlc-sandbox:latest docker/sandbox/", file=sys.stderr)
+            print(file=sys.stderr)
+            print("To run without Docker (not recommended for untrusted code),", file=sys.stderr)
+            print("set 'execution.sandbox.enabled: false' in config/default.yaml", file=sys.stderr)
+            print("=" * 70, file=sys.stderr)
+            sys.exit(1)
+


Hey Jeff, I was testing the CLI adapter locally and I found a bug with --no-sandbox. The flag is checked in run.py to skip the Docker preflight, but it's forwarded unchanged to the sub-script via cmd.extend(remaining). Since run_cli_evaluation.py doesn't recognize it, you get unrecognized arguments: --no-sandbox. Without Docker you're stuck either way.

Quick fix in scripts/aidlc-evaluator/run.py:

sandbox_disabled = "--no-sandbox" in remaining if args.mode in DOCKER_DEPENDENT_MODES and not sandbox_disabled: ... + if sandbox_disabled: + remaining = [arg for arg in remaining if arg != "--no-sandbox"] + cmd = [sys.executable, str(script)]

harmjeff and others added 24 commits April 29, 2026 14:07

fix(evaluator): add aidlc-cli-harness to root workspace dependencies

872c088

Without this, anthropic[bedrock] was not installed in the shared venv and the claude-code-sdk adapter could not be loaded. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

fix(sdk-adapter): import _get_container_cli from shared.sandbox

34a85ca

Was removed during cleanup but still referenced in the post-run test sandbox detection block. Now imported explicitly from the shared package. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

harmjeff requested review from a team, Kalindi-Dev, leandrodamascena, raj-jain-aws, scottschreckengaust and spraja08 April 30, 2026 18:12

github-advanced-security AI found potential problems Apr 30, 2026

View reviewed changes

harmjeff and others added 4 commits April 30, 2026 14:57

MichaelWalker-git requested a review from Copilot April 30, 2026 21:48

Copilot started reviewing on behalf of MichaelWalker-git April 30, 2026 21:48 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

harmjeff added the dependencies Pull requests that update a dependency file label May 1, 2026

harmjeff enabled auto-merge May 1, 2026 14:27

harmjeff and others added 2 commits May 1, 2026 11:25

Revert "docs: add CLAUDE.md with scan commands and semgrep suppressio…

2c41fb4

…n guidance" This reverts commit 3e20c0a.

This was referenced May 5, 2026

security: resolve critical/high code scanning errors #241

Closed

fix(security): resolve error-severity code scanning alerts #242

Merged

scottschreckengaust reviewed May 5, 2026

View reviewed changes

scottschreckengaust previously approved these changes May 5, 2026

View reviewed changes

harmjeff dismissed scottschreckengaust’s stale review via c0922ea May 5, 2026 20:52

Kalindi-Dev reviewed May 5, 2026

View reviewed changes

harmjeff and others added 3 commits May 5, 2026 18:04

fix(security): test inline nosemgrep comment style in run_git_compare.py

14d96cd

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into fix/evaluator-update

1c8722a

fix(security): update nosemgrep suppression in run_git_compare.py

631ebf3

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

github-advanced-security AI found potential problems May 5, 2026

View reviewed changes

Comment thread scripts/aidlc-evaluator/scripts/run_git_compare.py Fixed

fix(security): test full semgrep rule ID suppression in run_git_compa…

2c24b67

…re.py Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

github-advanced-security AI found potential problems May 5, 2026

View reviewed changes

Comment thread scripts/aidlc-evaluator/scripts/run_git_compare.py Fixed

fix(security): update subprocess suppression in run_git_compare.py

28ea9b6

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

leandrodamascena requested changes May 6, 2026

View reviewed changes

Conversation

harmjeff commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Parity sync (GitLab → GitHub)

Claude SDK adapter (--cli claude-code-sdk)

Kiro-cli per-stage simulator gates

Shared HumanSimulator

Plugin adapter registration

Infrastructure fixes

Documentation

User experience

Checklist

Test Plan

Acknowledgment

Uh oh!

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

scottschreckengaust May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scottschreckengaust May 5, 2026

Choose a reason for hiding this comment

Uh oh!

scottschreckengaust May 5, 2026

Choose a reason for hiding this comment

Uh oh!

scottschreckengaust May 5, 2026

Choose a reason for hiding this comment

Uh oh!

scottschreckengaust left a comment

Choose a reason for hiding this comment

Uh oh!

Kalindi-Dev left a comment

Choose a reason for hiding this comment

1. Frozen credentials can expire during long runs

2. Token counting may overcount

3. No unit tests for 800+ lines of new code

Uh oh!

Uh oh!

Uh oh!

leandrodamascena left a comment

Choose a reason for hiding this comment

Uh oh!

leandrodamascena May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

harmjeff commented Apr 30, 2026 •

edited

Loading

Claude SDK adapter (`--cli claude-code-sdk`)