feat(evaluator): parity sync, Claude SDK adapter, kiro per-stage gates, shared simulator#235
feat(evaluator): parity sync, Claude SDK adapter, kiro per-stage gates, shared simulator#235harmjeff wants to merge 37 commits into
Conversation
…mulator Parity with gitlab/aidlc-regression: - run.py: add git-compare / git-compare-report modes, Docker sandbox preflight check (DOCKER_DEPENDENT_MODES + check_docker_sandbox()) - run_evaluation.py: restore --rules-repo flag, pass through to aidlc_runner, save to evaluation-config.yaml; add direct-folder shortcut in stage_execute() so git-compare orchestrator can specify exact timestamped output paths - scripts/: add run_git_compare.py, regenerate_git_compare_report.py, generate_html_report.py (hard dependency of git-compare) New: claude-code-sdk adapter (packages/cli-harness): - ClaudeCodeSDKAdapter drives the executor via anthropic.AnthropicBedrock instead of `claude -p` subprocess, enabling interactive mid-workflow handoffs to an embedded Human Simulator - Reuses EXECUTOR_SYSTEM_PROMPT and SIMULATOR_SYSTEM_PROMPT_TEMPLATE verbatim from packages/execution — no prompt duplication - Simulator is injected on each handoff_to_simulator tool call; only file tools (read/write/list) available to simulator, not run_command - Token usage tracked separately for executor and simulator buckets; compatible with existing run-metrics.yaml schema - AdapterConfig gains simulator_model and aws_region fields - orchestrator.py passes aws_region into AdapterConfig - cli-harness pyproject.toml adds anthropic[bedrock]>=0.40, boto3>=1.42.47 Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
… with --no-sandbox check_docker_sandbox() now tries podman if docker is not on PATH. The preflight is skipped entirely when --no-sandbox is passed so the evaluator can run on hosts without a container runtime. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…r allocation
Container runtime (sandbox.py):
- Replace all hardcoded 'docker' CLI calls with _get_container_cli() which
probes docker, podman, and finch in order and caches the result
- is_docker_available() now works with any of the three runtimes
- sandbox_run/run_detached/stop/is_running/logs all use the detected CLI
Run folder allocation (run_evaluation.py):
- stage_execute() now pre-allocates the exact timestamped run folder
(same {timestamp}-{slug} format as runner.py) before invoking aidlc_runner,
then passes it as --output-dir so the runner uses it directly (Mode 1)
- Eliminates sentinel file reading, before/after directory diffing, and the
direct-folder heuristic — the path is deterministic from the start
- Enables safe parallel execution: each caller pre-allocates its own folder
- Removed now-dead helpers: _SENTINEL_NAME, _read_run_sentinel,
_list_run_folders, _find_new_run
- Added _rules_slug() helper to mirror runner.py slug generation
- stage_execute() signature gains cfg_data: dict parameter
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Without this, anthropic[bedrock] was not installed in the shared venv and the claude-code-sdk adapter could not be loaded. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…path runner.py: - Add Mode 1: if --output-dir is already a timestamped folder name, use it directly instead of creating a nested subfolder — required for the deterministic run folder pre-allocation in run_evaluation.py to work - Sentinel write now only happens in Mode 2 (new subfolder) - mkdir calls use exist_ok=True for idempotency claude_code_sdk.py: - Use config.rules_path directly instead of re-copying rules (orchestrator already set them up at output_dir/aidlc-rules before calling adapter.run()) - Fix credentials: use get_frozen_credentials() not .resolve() - Fix default model: global.anthropic.claude-opus-4-6-v1 (not a fake 4-7 ID) - Handle missing 'content' key in write_file tool input gracefully orchestrator.py: - Fix run_evaluation.py path: scripts/run_evaluation.py not run_evaluation.py at repo root Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Calls run_post_evaluation() from packages/execution after the executor loop completes, matching the behaviour of the Strands runner. Detects the project type in workspace/, installs deps, runs tests, and writes test-results.yaml. Sandbox is enabled when a container runtime (docker/podman/finch) is available. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
All three execution modes now give the human-analog agent full visibility
into the OpenAPI contract during design reviews and code review handoffs.
Strands swarm (full mode):
- simulator.py: SIMULATOR_SYSTEM_PROMPT_TEMPLATE gains {openapi_section}
placeholder; create_simulator() accepts openapi_content parameter and
renders a binding API contract section into the system prompt
- runner.py: run() accepts openapi_path, reads it, passes to create_simulator()
- cli.py: --openapi flag added; forwarded to run()
- run_evaluation.py stage_execute(): passes --openapi to aidlc_runner when present
Claude SDK adapter (cli mode, claude-code-sdk):
- Reads config.openapi_content and renders the same API contract section
into the simulator system prompt before the executor loop starts
CLI subprocess adapters (claude-code, kiro-cli):
- prompt_template.py: render_prompt() accepts openapi_content; injects it
as a binding contract section so the self-approving executor has the
full spec in view during design and code generation
- Both adapters pass config.openapi_content to render_prompt()
Plumbing:
- AdapterConfig gains openapi_content: str | None field
- orchestrator.run_cli_evaluation() reads openapi_path and populates it
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
simulator.py (execution package):
- Extract build_simulator_system_prompt() as a standalone function so
other adapters can construct the same prompt without Strands dependencies
- create_simulator() now delegates to it (no behaviour change)
simulator.py (cli-harness, new):
- HumanSimulator class: Anthropic SDK-based reviewer backed by
build_simulator_system_prompt() — single implementation used by both
kiro-cli and claude-code-sdk
- HumanSimulator.from_adapter_config() constructs from AdapterConfig fields
- HumanSimulator.respond() runs the turn loop with file tool support
- _exec_file_tool() provides read/write/list scoped to run_folder
kiro_cli.py:
- Replaces hardcoded "Approve & Continue" resumption with HumanSimulator.respond()
- Each kiro turn's output is fed to the simulator; its response resumes the session
- render_prompt() called with with_simulator=True so executor pauses at gates
instead of self-approving
claude_code_sdk.py:
- Replaces inline _run_simulator_turn() loop with HumanSimulator.respond()
- _SIMULATOR_TOOLS constant removed (owned by HumanSimulator now)
- Dead tech_env_section / openapi_section string-building removed
prompt_template.py:
- {approval_rule} placeholder replaces hardcoded self-approve instruction
- _SELF_APPROVE_RULE / _SIMULATOR_HANDOFF_RULE constants
- render_prompt(with_simulator=True) injects the handoff rule
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
run_cli_evaluation.py: - Add --simulator-model flag; resolve from config models.simulator.model_id when not provided (same pattern as --scorer-model) - Pass simulator_model through to run_cli_evaluation() orchestrator.py: - run_cli_evaluation() accepts simulator_model parameter - Populates AdapterConfig.simulator_model so HumanSimulator uses the same model as the Strands swarm simulator instead of falling back to the executor model Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
All three execution modes now use the same HumanSimulator implementation: execution/runner.py: - _make_simulator_tool() factory creates a Strands @tool wrapping HumanSimulator.respond() — the executor calls handoff_to_simulator as a direct tool instead of routing via a second Swarm agent - Swarm remains single-agent [executor]; MetricsCollector unchanged - Strands simulator Agent (create_simulator) removed from runner.py execution/agents/executor.py: - create_executor() accepts optional simulator_tool parameter and appends it to the executor's tool list when provided execution/pyproject.toml: - Add anthropic[bedrock]>=0.40 and boto3 deps (needed by HumanSimulator) Result: vision + tech_env + openapi are injected into the same build_simulator_system_prompt() for full, cli (claude-code-sdk), and cli (kiro-cli) modes. One implementation, three entry points. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ll mode HumanSimulator now accumulates token usage per respond() call in _input_tokens/_output_tokens etc., exposed via accumulated_usage property using the same camelCase key format as Strands accumulated_usage. runner.py: - _make_simulator_tool() returns (tool, simulator_instance) so the instance can be inspected after the swarm completes - After swarm finishes, calls collector.record_simulator_usage() with simulator_instance.accumulated_usage to capture SDK tokens separately metrics.py: - MetricsCollector gains _simulator_usage field and record_simulator_usage() - build_metrics() injects _simulator_usage into per_agent["simulator"] so the output shape matches the old two-agent Swarm (executor + simulator buckets), with the simulator bucket now sourced from Anthropic SDK usage rather than Strands accumulated_usage Result: run-metrics.yaml tokens.per_agent has distinct "executor" and "simulator" entries with no mixing, for all three execution modes. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Was removed during cleanup but still referenced in the post-run test sandbox detection block. Now imported explicitly from the shared package. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Matches claude-code-sdk adapter — calls run_post_evaluation() after normalize_output() so test-results.yaml is produced and stage 2 is reported in the evaluation summary. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Generated code targets Python 3.13 (via pyproject.toml requires-python >=3.13). The 3.14 base caused uv venv to pick up the 3.14 interpreter which broke uvicorn startup in post-run tests and contract tests. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…reviews Previous approach used --no-interactive + --resume loop, but kiro ignores the pause instruction in the prompt and runs the entire workflow in one turn when --no-interactive is set. New approach: run kiro without --no-interactive, drive it via stdin. - Single persistent kiro process with stdin=PIPE, stdout=PIPE - Send initial prompt to stdin; read output with idle_timeout_s=8.0 - When kiro goes idle (waiting for input), call HumanSimulator.respond() with a prompt to read aidlc-docs and provide feedback - Write simulator response back to kiro's stdin - Detect construction completion via aidlc-docs/construction/*.md - Send /quit to close the session cleanly This gives the simulator genuine review opportunities at each AIDLC gate rather than having kiro race through the entire workflow unreviewed. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
select.select on macOS with text-mode subprocess pipes deadlocks when the pipe buffer fills — kiro produces output but readline() blocks forever because select never returns ready. Fix: bufsize=0 (unbuffered bytes), background reader thread pushes raw chunks onto a queue.Queue, _read_until_idle() drains with a queue.get(timeout=idle_s) which reliably detects silence on all platforms. All stdin writes updated to encode() for bytes mode. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…t on idle _read_until_idle() previously only returned when kiro went quiet for idle_timeout_s seconds. If kiro kept streaming output after completing the AIDLC workflow, the adapter would never detect completion. Now checks _is_complete() (construction/*.md exists) after processing each chunk, so the adapter exits the read loop as soon as the workflow is done regardless of whether kiro has stopped talking. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…le wait timeout /quit is not a valid kiro command — the process wouldn't exit, causing process.wait(timeout=5) to hang and eventually trigger the 7200s timeout. Now uses process.kill() with a guarded wait on completion and in the outer cleanup block. total_rc is set to 0 when we initiate the kill (workflow completed successfully). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Previously kiro's output was only written to the session log (or shown in full with --verbose). Now meaningful lines (stage transitions, file creations, thinking steps, responses) are printed to stderr via _log with [kiro] prefix, filtered to skip spinner frames, credit footers, and deduplicated consecutive lines. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ighten skip list Previous approach split on newlines in chunks, but kiro streams word-by-word so most chunks had no newlines and every word appeared as a separate 'line'. Now accumulates chars into _line_buf and flushes on '\n', giving complete lines to the filter. Extended skip list to suppress spinners, box-drawing, ANSI remnants, help text, and credit/model footers. Only substantive content (stage transitions, file operations, responses) reaches stderr. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…review Root cause: kiro's steering file drives it to full completion regardless of the user prompt, so interactive stdin and pause instructions are ignored. The only reliable gate mechanism is kiro's --resume flag. New approach: - Phase 1: run kiro with --no-interactive; prompt instructs it to execute INCEPTION ONLY and stop after execution-plan.md is written - Simulator: reads inception artifacts from aidlc-docs/inception/, reviews requirements/design/plan, provides feedback and direction - Phase 2: resume kiro with --resume [simulator feedback]; prompt directs it to proceed with Construction using the feedback This gives the simulator a genuine review of the inception artifacts before construction begins, using kiro's native conversation continuation rather than fighting its steering rules. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Replaces the single inception/construction two-phase split with four individual stage gates, each backed by a simulator review: Gate 1 — Requirements Analysis Kiro writes requirements.md + requirement-verification-questions.md Simulator: answers verification questions, approves requirements Gate 2 — Workflow Planning + Application Design Kiro writes execution-plan.md + full application design docs Simulator: approves workflow plan and architecture Gate 3 — Code Generation Plan Kiro writes the code generation plan (no code yet) Simulator: approves plan before any code is written Gate 4 — Code Generation + Build and Test Kiro generates all code, runs tests, writes build summary (No simulator gate after — construction is final) Each stage uses --no-interactive + --resume so kiro exits cleanly at the sentinel file boundary, the simulator reviews aidlc-docs/, and the feedback is injected into the next --resume prompt. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…lator
Plugin adapter registration (registry.py):
- register_adapter(name, fqn) adds an adapter at runtime without editing
framework code
- load_adapters_from_config(cfg_data) reads cli.adapters from config YAML
and registers each entry; called from run_cli_evaluation.py after config load
- Built-in adapters remain in the default map unchanged
config/default.yaml:
- cli.adapters: {} extension point documented and ready for custom entries
Shared HumanSimulator (orchestrator → AdapterConfig → adapters):
- AdapterConfig gains simulator: HumanSimulator | None field (TYPE_CHECKING
guard prevents circular import at runtime)
- orchestrator.run_cli_evaluation() constructs HumanSimulator once with
full document context (vision, tech_env, openapi) and injects into config
- kiro_cli.py: reads config.simulator instead of building locally — removes
9 lines of duplicate vision/tech_env reads and HumanSimulator construction
- claude_code_sdk.py: reads config.simulator instead of building locally —
removes 9 lines of duplicate construction; executor Bedrock client
(separate from simulator) is still built locally as before
Both adapters now raise RuntimeError if config.simulator is None, making
the dependency explicit rather than silently falling back to no-review.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
ARCHITECTURE.md: - New section 6.5 CLI Evaluation: documents the cli harness pipeline, HumanSimulator injection pattern, simulator gate approach, and plugin registration - New cookbook section "Adding a New CLI Adapter" with a complete worked example (Step 1 implement, Step 2 register in config, Step 3 verify) and a contracts table covering CLIAdapter, simulator, normalizer, post-run tests, and document context fields - Cross-references plugin registration anchor CONTRIBUTING.md: - Updated package list to include cli-harness, ide-harness, trend-reports - Updated work streams table with CLI Adapters, IDE Adapters, Trend Reporting rows Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Semgrep OSS found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
…dings Semgrep flags the closing paren of multi-line subprocess calls when the suppression comment is on the preceding line. Moved all suppressions to inline comments on the subprocess.run/Popen line itself so semgrep correctly associates the suppression with the finding. Affected: - run.py: check_docker_sandbox() — two container CLI info/images calls - run_git_compare.py:372 — run_evaluation.py subprocess call - kiro_cli.py:181 — Popen for kiro-cli chat subprocess - claude_code_sdk.py:312 — run_command tool subprocess.run Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…xtures test_credential_scrubber.py intentionally contains fake credentials (example JWT, placeholder GitHub tokens, dummy API keys) to test the scrubbing logic. Add .gitleaks.toml to suppress these false positives. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…lowlist ARCHITECTURE.md: align all table cell widths to match separator row exactly so MD060 table-column-style passes with the repo's 'aligned' style config. Replaced Unicode em dashes with ASCII hyphens in table cells to avoid byte-vs-char width discrepancy. .gitleaks.toml: suppress false positives in test_credential_scrubber.py which intentionally uses fake JWT/GitHub tokens to test scrubbing logic. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
CI runs semgrep with --config=r/all which uses the full registry rule ID: python.lang.security.audit.dangerous-subprocess-use-audit.dangerous-subprocess-use-audit Our previous suppressions used the short form 'dangerous-subprocess-use-audit' which only matches local/custom configs. Updated all five suppression comments to use the full dotted rule ID so CI correctly ignores them. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR expands scripts/aidlc-evaluator/ with (1) a new git-ref comparison runner and reports, and (2) a CLI-evaluation architecture that standardizes “human reviewer” simulation across execution modes via a shared HumanSimulator, plus plugin-style CLI adapter registration.
Changes:
- Add
git-compare/git-compare-reportmodes with markdown+HTML reporting and (optional) parallel runs. - Introduce a shared Anthropic SDK–based
HumanSimulatorused by Strands execution,claude-code-sdk, andkiro-clistage gates. - Add config-driven CLI adapter registration (
cli.adapters) and wire it into the CLI evaluation entrypoint.
Reviewed changes
Copilot reviewed 28 out of 29 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/aidlc-evaluator/uv.lock | Adds new deps (Anthropic SDK + boto3) to workspace lockfile. |
| scripts/aidlc-evaluator/scripts/run_git_compare.py | New git-ref comparison runner + report generation + optional parallel execution. |
| scripts/aidlc-evaluator/scripts/run_evaluation.py | Adds --rules-repo, changes run-folder allocation behavior for stage execution. |
| scripts/aidlc-evaluator/scripts/run_cli_evaluation.py | Loads adapters from config and adds --simulator-model plumbed into orchestrator. |
| scripts/aidlc-evaluator/scripts/regenerate_git_compare_report.py | New utility to regenerate git-compare reports without re-running evaluations. |
| scripts/aidlc-evaluator/scripts/generate_html_report.py | New interactive HTML report generator (Chart.js) for git-compare results. |
| scripts/aidlc-evaluator/run.py | Adds new modes and container sandbox preflight (docker/podman). |
| scripts/aidlc-evaluator/pyproject.toml | Adds aidlc-cli-harness workspace dependency. |
| scripts/aidlc-evaluator/packages/shared/src/shared/sandbox.py | Adds container runtime auto-detection (docker/podman/finch) for sandbox runs. |
| scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/runner.py | Switches simulator to a tool-based HumanSimulator integration; updates run-folder creation semantics. |
| scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/metrics.py | Adds separate simulator token usage bucket to metrics. |
| scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/cli.py | Adds --openapi passthrough into runner for simulator contract-awareness. |
| scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/agents/simulator.py | Extracts build_simulator_system_prompt() and adds OpenAPI contract injection. |
| scripts/aidlc-evaluator/packages/execution/src/aidlc_runner/agents/executor.py | Allows injecting simulator tool into executor tools. |
| scripts/aidlc-evaluator/packages/execution/pyproject.toml | Adds Anthropic SDK + boto3 dependencies. |
| scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/simulator.py | New shared Anthropic Bedrock HumanSimulator with file tools and usage tracking. |
| scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/registry.py | Adds adapter registration + config-based adapter loading. |
| scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/prompt_template.py | Adds OpenAPI injection and “pause for reviewer” vs “self-approve” prompt modes. |
| scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/orchestrator.py | Constructs and injects shared HumanSimulator; fixes run_evaluation path. |
| scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/adapters/kiro_cli.py | Adds per-stage simulator gates and post-run tests integration. |
| scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/adapters/claude_code_sdk.py | New Anthropic SDK adapter with inline simulator handoffs + tool loop. |
| scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/adapters/claude_code.py | Injects OpenAPI into prompt for the subprocess-based adapter. |
| scripts/aidlc-evaluator/packages/cli-harness/src/cli_harness/adapter.py | Extends AdapterConfig with simulator/openapi/aws fields. |
| scripts/aidlc-evaluator/packages/cli-harness/pyproject.toml | Adds Anthropic SDK + boto3 dependencies. |
| scripts/aidlc-evaluator/docker/sandbox/Dockerfile | Pins sandbox image to Python 3.13. |
| scripts/aidlc-evaluator/config/default.yaml | Adds cli.adapters extension point. |
| scripts/aidlc-evaluator/CONTRIBUTING.md | Updates package/work-stream documentation. |
| scripts/aidlc-evaluator/ARCHITECTURE.md | Documents CLI evaluation architecture and plugin adapter cookbook. |
| scripts/aidlc-evaluator/.gitleaks.toml | Adds gitleaks allowlist for known fake-credential fixtures. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| from aidlc_runner.agents.executor import create_executor | ||
| from aidlc_runner.agents.simulator import create_simulator | ||
| from aidlc_runner.agents.simulator import build_simulator_system_prompt |
| session_kwargs: dict = {} | ||
| if config.aws_profile: | ||
| session_kwargs["profile_name"] = config.aws_profile | ||
| boto_session = boto3.Session(**session_kwargs) | ||
| frozen = boto_session.get_credentials().get_frozen_credentials() | ||
| client = anthropic.AnthropicBedrock( | ||
| aws_access_key=frozen.access_key, | ||
| aws_secret_key=frozen.secret_key, | ||
| aws_session_token=frozen.token, | ||
| aws_region=aws_region, |
There was a problem hiding this comment.
This seems like a valid check here. What are your thoughts?
| "--scorer-model", scorer_model, | ||
| "--rules-ref", version.ref, | ||
| "--report-format", "both", | ||
| "--output-dir", str(run_folder), # Pass full folder path, not parent dir |
|
|
||
| @property | ||
| def accumulated_usage(self) -> dict[str, int]: | ||
| """Token totals across all respond() calls, keyed by snake_case names |
| def _container_cli() -> str | None: | ||
| """Return the first available container CLI: docker or podman.""" | ||
| import shutil | ||
| for cli in ("docker", "podman"): | ||
| if shutil.which(cli): | ||
| return cli | ||
| return None | ||
|
|
| Also writes a sentinel file (``{output_dir}/.last_run_folder``) containing | ||
| the absolute path of the new run folder so that parent orchestrators can | ||
| discover the folder without racy before/after directory listing. | ||
| Also writes a sentinel file (``{output_dir.parent}/.last_run_folder``) in |
Semgrep requires the suppression comment to be on the exact line of the finding. Preceding-line comments are not reliably associated with the call when both the finding and suppression are new (introduced in this PR). Moved all five nosemgrep suppressions to inline on the subprocess.run() / Popen() line itself, using the full rule ID: python.lang.security.audit.dangerous-subprocess-use-audit.dangerous-subprocess-use-audit Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Documents the one-pass scan sequence, correct nosemgrep inline syntax, full rule IDs required for CI (--config=r/all), and how to distinguish the live 'semgrep' CI job from the stale 'Semgrep OSS' code scanning annotations. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…n guidance" This reverts commit 3e20c0a.
| # Multi-language sandbox image for running AI-generated code in isolation. | ||
| # | ||
| # Includes Python 3.14 + uv, Node.js 22 + npm, and common build tools. | ||
| # Includes Python 3.13 + uv, Node.js 22 + npm, and common build tools. |
There was a problem hiding this comment.
was there something that caused 3.14 -> 3.13?
| import json | ||
| import logging | ||
| import os | ||
| import shlex | ||
| import subprocess |
There was a problem hiding this comment.
There a RCE and vulnerabilities by importing these at the top level. All inputs and outputs must be treated as untrusted.
| return resolved.read_text(encoding="utf-8") | ||
|
|
||
| elif name == "run_command": | ||
| command = tool_input["command"] |
There was a problem hiding this comment.
How might the tool's input be validated?
Change inline full-rule-ID nosemgrep comments to preceding-line short-name format (# nosemgrep: dangerous-subprocess-use-audit), matching the pattern used throughout the rest of the codebase that the Semgrep OSS GitHub App correctly honours. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Kalindi-Dev
left a comment
There was a problem hiding this comment.
@harmjeff
Few suggestions:
1. Frozen credentials can expire during long runs
frozen = boto_session.get_credentials().get_frozen_credentials()
client = anthropic.AnthropicBedrock(
aws_access_key=frozen.access_key,
aws_secret_key=frozen.secret_key,
aws_session_token=frozen.token,
...
)With a 2-hour timeout and STS session tokens often lasting 1 hour, the credentials could expire mid-run. The AnthropicBedrock client doesn't refresh them. This affects both the executor and simulator.
2. Token counting may overcount
@property
def total(self) -> int:
return self.input_tokens + self.output_tokens + self.cache_read_tokens + self.cache_write_tokensCache read/write tokens are typically a breakdown of input tokens (not additive). Summing all four could inflate totals.
3. No unit tests for 800+ lines of new code
The PR adds claude_code_sdk.py (587 lines), simulator.py (229 lines), and significantly rewrites kiro_cli.py — but no test files. The tool execution logic (_exec_tool, _resolve_safe), the token tracking, and the stage-gate logic
are all testable in isolation without LLM calls.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…re.py Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
leandrodamascena
left a comment
There was a problem hiding this comment.
Hey @harmjeff, excellent work on this PR. I created a new adapter for OpenCode locally and it worked really well. The CLIAdapter interface + plugin registry made it trivial to add a new CLI tool without touching any framework code, this is really nice.
I ran both claude-code and opencode against the sci-calc test case and results are pretty similar. The only issue I found was with token usage reporting for OpenCode, but I know why and how to fix.
I have the adapter file ready if you want to include it in this PR or as a follow-up.
After you fix the bug with --no-sandbox flag, we are good to merge and then work with @scottschreckengaust in another PR to add this in CodeBuild.
| sandbox_disabled = "--no-sandbox" in remaining | ||
| if args.mode in DOCKER_DEPENDENT_MODES and not sandbox_disabled: | ||
| if not check_docker_sandbox(): | ||
| print("=" * 70, file=sys.stderr) | ||
| print("ERROR: Docker sandbox image not found", file=sys.stderr) | ||
| print("=" * 70, file=sys.stderr) | ||
| print(file=sys.stderr) | ||
| print("The evaluation framework requires the Docker sandbox image", file=sys.stderr) | ||
| print("'aidlc-sandbox:latest' to run generated code safely.", file=sys.stderr) | ||
| print(file=sys.stderr) | ||
| print("To build the image, run:", file=sys.stderr) | ||
| print(" ./docker/sandbox/build.sh", file=sys.stderr) | ||
| print(file=sys.stderr) | ||
| print("Or manually:", file=sys.stderr) | ||
| print(" docker build -t aidlc-sandbox:latest docker/sandbox/", file=sys.stderr) | ||
| print(file=sys.stderr) | ||
| print("To run without Docker (not recommended for untrusted code),", file=sys.stderr) | ||
| print("set 'execution.sandbox.enabled: false' in config/default.yaml", file=sys.stderr) | ||
| print("=" * 70, file=sys.stderr) | ||
| sys.exit(1) | ||
|
|
There was a problem hiding this comment.
Hey Jeff, I was testing the CLI adapter locally and I found a bug with --no-sandbox. The flag is checked in run.py to skip the Docker preflight, but it's forwarded unchanged to the sub-script via cmd.extend(remaining). Since run_cli_evaluation.py doesn't recognize it, you get unrecognized arguments: --no-sandbox. Without Docker you're stuck either way.
Quick fix in scripts/aidlc-evaluator/run.py:
sandbox_disabled = "--no-sandbox" in remaining
if args.mode in DOCKER_DEPENDENT_MODES and not sandbox_disabled:
...
+ if sandbox_disabled:
+ remaining = [arg for arg in remaining if arg != "--no-sandbox"]
+
cmd = [sys.executable, str(script)]
Summary
Brings
scripts/aidlc-evaluatorto parity with the internal GitLab baseline and adds significant new CLI evaluation capabilities: a Claude SDK adapter with mid-workflow simulator handoffs, kiro-cli per-stage reviewer gates, a shared HumanSimulator injected by the orchestrator, and a plugin adapter registration system. All changes are scoped toscripts/aidlc-evaluator/.Changes
Parity sync (GitLab → GitHub)
run.py: addgit-compare/git-compare-reportmodes; Docker/Podman/Finch preflight checkscripts/run_evaluation.py: restore--rules-repoflag; deterministic run folder pre-allocation (eliminates sentinel file races, enables parallel execution)scripts/run_git_compare.py,regenerate_git_compare_report.py,generate_html_report.pyClaude SDK adapter (
--cli claude-code-sdk)ClaudeCodeSDKAdapterdrives the AIDLC executor viaanthropic.AnthropicBedrockturn-by-turnhandoff_to_simulatortool calls to inject human-analog reviews mid-workflowEXECUTOR_SYSTEM_PROMPTfrompackages/execution— no prompt duplicationKiro-cli per-stage simulator gates
--no-interactive+--resumeto inject simulator feedback between stagesShared HumanSimulator
HumanSimulatorinpackages/cli-harness/src/cli_harness/simulator.pyused by all three execution modes (Strands swarm, claude-code-sdk, kiro-cli)AdapterConfig.simulator— adapters no longer construct it themselvesbuild_simulator_system_prompt()is the single source of truth inpackages/executionPlugin adapter registration
registry.register_adapter(name, fqn)andload_adapters_from_config(cfg_data)allow adding CLI adapters viaconfig/default.yamlwith no framework code changesconfig/default.yamlgainscli.adapters: {}extension pointInfrastructure fixes
shared/sandbox.py: auto-detect docker, podman, or finch as container runtimedocker/sandbox/Dockerfile: pin to Python 3.13 (was 3.14, causing uvicorn startup failures in generated code)packages/execution/runner.py: Mode 1 direct-folder path for orchestrator-specified run foldersrun.py:--no-sandboxskips container preflight; podman recognised alongside dockerDocumentation
ARCHITECTURE.md: new section 6.5 CLI Evaluation + full "Adding a New CLI Adapter" cookbook with worked example and contracts tableCONTRIBUTING.md: updated package list and work streams table to include cli-harness, ide-harness, trend-reportsUser experience
Before: The evaluator only supported the Strands two-agent swarm for full runs, with no programmatic CLI adapter having a human reviewer. The
claude-codeadapter ran as a one-shot subprocess with no simulation.After: Three execution modes all use the same human-analog simulator with full document context:
run.py full— Strands swarm with HumanSimulator toolrun.py cli --cli claude-code-sdk— SDK-driven executor with inline simulator handoffsrun.py cli --cli kiro-cli— kiro subprocess with 4 per-stage simulator review gatesNew adapters can be registered with one config line and zero framework changes.
Checklist
Test Plan
Expected results for each mode:
Acknowledgment
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.