awslabs · harmjeff · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026
@@ -460,6 +460,36 @@
 
 Supported adapters: Cursor, Cline, Copilot, Kiro, Windsurf, Antigravity.
 
+### 6.5 CLI Evaluation (`run_cli_evaluation.py`)
+
+Runs the AIDLC workflow through CLI-based AI assistants (Claude Code, Kiro CLI, etc.):
+
+```text
+load_adapters_from_config(cfg_data)  ← register any custom adapters from config.yaml
+  │
+get_adapter(name)     ← lazy import from registry
+  │
+  ├── check_prerequisites()
+  ├── HumanSimulator built once by orchestrator (vision + tech_env + openapi injected)
+  ├── adapter.run(config) ──► CLI-specific automation + simulator gate reviews
+  ├── normalize_output()  ──► standard run folder layout
+  └── run_evaluation.py --evaluate-only  ──► stages 2-6
+```
+
+**Adapter pattern**: Each CLI tool is implemented as a subclass of `CLIAdapter` (`packages/cli-harness/src/cli_harness/adapter.py`) with three methods:
+
+- `name` — human-readable identifier (e.g. `"kiro-cli"`)
+- `check_prerequisites()` — verify the CLI tool is installed and credentials are valid
+- `run(config: AdapterConfig) -> AdapterResult` — execute the AIDLC workflow and return results
+
+**HumanSimulator injection**: The orchestrator constructs a single `HumanSimulator` with the full document context (vision, tech-env, OpenAPI spec) before calling the adapter. It is passed in as `config.simulator`. Adapters access it via `config.simulator.respond(message)` — they do not construct it themselves.
+
+**Simulator gates**: Adapters use `config.simulator` to inject human-reviewer feedback at key workflow stages. The kiro-cli adapter uses 4 stage gates (requirements → design → code-gen plan → construction); the claude-code-sdk adapter intercepts `handoff_to_simulator` tool calls inline.
+
+**Plugin registration**: Custom adapters can be added without modifying framework code — see [Adding a New CLI Adapter](#adding-a-new-cli-adapter) below.
+
+Supported built-in adapters: `claude-code`, `claude-code-sdk`, `kiro-cli`.
+
 ---
 
 ## 7. Data Flow: YAML Artifact Graph
@@ -633,6 +663,94 @@
 1. Create `config/<model-name>.yaml` with `models.executor.model_id` set to the Bedrock model ID
 2. The batch runner will automatically discover it
 
+### Adding a New CLI Adapter
+
+CLI adapters live in `packages/cli-harness` and follow a plugin pattern — no framework code changes are needed.
+
+**Step 1 — Implement the adapter**
+
+Create a module anywhere importable (e.g. `packages/cli-harness/src/cli_harness/adapters/my_tool.py`):
+
+```python
+from cli_harness.adapter import AdapterConfig, AdapterResult, CLIAdapter
+
+class MyToolAdapter(CLIAdapter):
+    @property
+    def name(self) -> str:
+        return "my-tool"
+
+    def check_prerequisites(self) -> tuple[bool, str]:
+        import shutil
+        if not shutil.which("my-tool"):
+            return False, "'my-tool' not found in PATH"
+        return True, "my-tool found"
+
+    def run(self, config: AdapterConfig) -> AdapterResult:
+        import time, shutil
+        from cli_harness.normalizer import normalize_output
+
+        start = time.monotonic()
+        workspace = config.output_dir / "workspace"
+        workspace.mkdir(parents=True, exist_ok=True)
+
+        # Copy inputs, inject rules, run the CLI tool...
+        # Use config.simulator.respond(message) at review gates.
+        simulator = config.simulator  # pre-built with vision/tech_env/openapi context
+        if simulator is None:
+            raise RuntimeError("my-tool requires a simulator (set --simulator-model)")
+
+        # ... run CLI tool stages, call simulator.respond() between stages ...
+
+        elapsed = time.monotonic() - start
+        normalize_output(
+            source_dir=workspace,
+            output_dir=config.output_dir,
+            adapter_name=self.name,
+            elapsed_seconds=elapsed,
+        )
+        dst_docs = config.output_dir / "aidlc-docs"
+        return AdapterResult(
+            success=dst_docs.is_dir(),
+            output_dir=config.output_dir,
+            aidlc_docs_dir=dst_docs if dst_docs.is_dir() else None,
+            workspace_dir=workspace,
+            elapsed_seconds=elapsed,
+        )
+```
+
+**Step 2 — Register in config** (no framework edits needed)
+
+Add one line to `config/default.yaml` (or your own config file):
+
+```yaml
+cli:
+  adapters:
+    my-tool: "cli_harness.adapters.my_tool.MyToolAdapter"
+```
+
+**Step 3 — Verify**
+
+```bash
+# Confirm it appears
+uv run python run.py cli --list
+
+# Check prerequisites
+uv run python run.py cli --cli my-tool --check-only
+
+# Run evaluation
+uv run python run.py cli --cli my-tool --scenario sci-calc
+```
+
+**Key contracts for adapter implementors:**
+
+| What | Where | Notes |
+|---|---|---|
+| Abstract base | `cli_harness/adapter.py` — `CLIAdapter` | Implement `name`, `check_prerequisites`, `run` |
+| Simulator | `config.simulator` (`HumanSimulator`) | Call `.respond(message)` at review gates; never construct it yourself |
+| Output layout | `cli_harness/normalizer.py` — `normalize_output()` | Call at end of `run()` to produce standard `run-meta.yaml` / `run-metrics.yaml` |
+| Post-run tests | `aidlc_runner.post_run.run_post_evaluation()` | Optional; call after `normalize_output()` to run generated project tests |
+| Document context | `config.vision_path`, `config.tech_env_path`, `config.openapi_content` | Available if you need them; simulator already has this context |
+
 ### Adding a New IDE Adapter
 
 1. Create `packages/ide-harness/src/ide_harness/adapters/<name>.py`

@@ -36,11 +36,14 @@ git checkout -b feature/your-feature-name
 
 Work in the appropriate package:
 
-- `aidlc-runner/` - Execution Framework (two-agent AIDLC workflow runner)
+- `packages/execution/` - Execution Framework (two-agent AIDLC workflow runner)
+- `packages/cli-harness/` - CLI Adapter Framework (Claude Code, Kiro CLI, custom tools)
+- `packages/ide-harness/` - IDE Adapter Framework (Cursor, Cline, Kiro, etc.)
 - `packages/qualitative/` - Semantic Evaluation (intent & design similarity scoring)
 - `packages/quantitative/` - Code Evaluation (linting, security, organization)
 - `packages/nonfunctional/` - NFR Evaluation (tokens, timing, consistency)
 - `packages/reporting/` - Report generation
+- `packages/trend-reports/` - Cross-release trend reporting
 - `packages/shared/` - Common utilities
 
 Or contribute to other work streams:
@@ -96,14 +99,17 @@ git commit -m "Add token tracking to nonfunctional package"
 
 The project is organized around six big rocks. Your changes will typically fall into one or more of these:
 
-| Work Stream             | Description                                   | Package / Area            |
-| ----------------------- | --------------------------------------------- | ------------------------- |
-| **Golden Test Case**    | Curated baseline test inputs                  | `test_cases/`             |
-| **Execution Framework** | Two-agent AIDLC workflow runner (Owner: Jeff) | `aidlc-runner/`           |
-| **Semantic Evaluation** | Intent & design similarity scoring            | `packages/qualitative/`   |
-| **Code Evaluation**     | Linting, security, organization               | `packages/quantitative/`  |
-| **NFR Evaluation**      | Tokens, timing, consistency                   | `packages/nonfunctional/` |
-| **GitHub CI/CD**        | Pipeline integration & management             | `.github/workflows/`      |
+| Work Stream             | Description                                   | Package / Area               |
+| ----------------------- | --------------------------------------------- | ---------------------------- |
+| **Golden Test Case**    | Curated baseline test inputs                  | `test_cases/`                |
+| **Execution Framework** | Two-agent AIDLC workflow runner               | `packages/execution/`        |
+| **CLI Adapters**        | CLI tool integrations (Claude Code, Kiro CLI) | `packages/cli-harness/`      |
+| **IDE Adapters**        | IDE tool integrations (Cursor, Cline, etc.)   | `packages/ide-harness/`      |
+| **Semantic Evaluation** | Intent & design similarity scoring            | `packages/qualitative/`      |
+| **Code Evaluation**     | Linting, security, organization               | `packages/quantitative/`     |
+| **NFR Evaluation**      | Tokens, timing, consistency                   | `packages/nonfunctional/`    |
+| **Trend Reporting**     | Cross-release metric tracking                 | `packages/trend-reports/`    |
+| **GitHub CI/CD**        | Pipeline integration & management             | `.github/workflows/`         |
 
 ## Code Standards
 

@@ -41,3 +41,6 @@ execution:
 
 tools:
   pmd_path: null  # Path to PMD executable; if null, looks for 'pmd' on PATH
+
+cli:
+  adapters: {}  # Register custom CLI adapters: name: "mypackage.MyAdapter"
@@ -1,6 +1,6 @@
 # Multi-language sandbox image for running AI-generated code in isolation.
 #
-# Includes Python 3.14 + uv, Node.js 22 + npm, and common build tools.
+# Includes Python 3.13 + uv, Node.js 22 + npm, and common build tools.
 # Runs as a non-root user with no credentials or host tools.
 #
 # Security notes:
@@ -9,7 +9,7 @@
 
 # checkov:skip=CKV_DOCKER_2:HEALTHCHECK not needed for ephemeral test sandbox
 # nosemgrep: dockerfile-source-not-pinned
-FROM public.ecr.aws/docker/library/python:3.14-slim@sha256:3989a23fd2c28a34c7be819e488b958a10601d421ac25bea1e7a5d757365e2d5 AS base
+FROM public.ecr.aws/docker/library/python:3.13-slim@sha256:8922791069fdfdd6056cf7f418a8655d970862d1972570d4c0e78dfc43afacd6 AS base
 
 # Install system dependencies and Node.js 22
 # nosemgrep: set-pipefail

@@ -5,6 +5,8 @@
 requires-python = ">=3.13"
 dependencies = [
     "pyyaml>=6.0",
+    "anthropic[bedrock]>=0.40",
+    "boto3>=1.42.47",
 ]
 
 [project.optional-dependencies]

@@ -5,6 +5,10 @@
 from abc import ABC, abstractmethod
 from dataclasses import dataclass, field
 from pathlib import Path
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+    from cli_harness.simulator import HumanSimulator
 
 
 @dataclass
@@ -17,7 +21,11 @@ class AdapterConfig:
     tech_env_path: Path | None = None
     prompt_template: str | None = None
     model: str | None = None
+    simulator_model: str | None = None  # kept for backwards compat; prefer simulator field
     aws_profile: str | None = None
+    aws_region: str | None = None
+    openapi_content: str | None = None  # injected into prompt/simulator for contract validation
+    simulator: "HumanSimulator | None" = None  # pre-built by orchestrator; shared across adapters
     timeout_seconds: int = 7200  # 2 hours max
 
 

@@ -128,8 +128,11 @@ def run(self, config: AdapterConfig) -> AdapterResult:
                 shutil.copy2(rules_path, rules_dir / rules_path.name)
                 _log(f"Copied AIDLC rules file: {rules_path.name}")
 
-            # Build the prompt
-            prompt = config.prompt_template or render_prompt()
+            # Build the prompt — inject OpenAPI spec so the self-approving executor
+            # has the full contract in view during design and code review.
+            prompt = config.prompt_template or render_prompt(
+                openapi_content=config.openapi_content,
+            )
 
             # Build command — claude -p for non-interactive print mode
             cmd = [
-Original file line number
+Diff line change
@@ Expand Up / @@ -41,3 +41,6 @@ execution: @@
     tools:
       pmd_path: null  # Path to PMD executable; if null, looks for 'pmd' on PATH
+    cli:
+      adapters: {}  # Register custom CLI adapters: name: "mypackage.MyAdapter"