Skip to content
Open
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
0d04c82
feat(evaluator): parity sync + Claude SDK adapter with interactive si…
harmjeff Apr 29, 2026
3bbc71a
fix(evaluator): support podman as docker fallback, skip sandbox check…
harmjeff Apr 29, 2026
2875de9
fix(evaluator): container runtime detection + deterministic run folde…
harmjeff Apr 29, 2026
872c088
fix(evaluator): add aidlc-cli-harness to root workspace dependencies
harmjeff Apr 29, 2026
1213909
fix(evaluator): runner Mode 1, sdk adapter bugs, orchestrator script …
harmjeff Apr 29, 2026
b25fecf
fix(sdk-adapter): add post-run test evaluation (stage 2)
harmjeff Apr 29, 2026
e0e08e9
feat(evaluator): inject OpenAPI contract into all human-analog agents
harmjeff Apr 29, 2026
35fc0be
feat(evaluator): shared HumanSimulator + kiro-cli human analog review
harmjeff Apr 29, 2026
ebbbc1d
fix(evaluator): wire simulator model from config into CLI adapter runs
harmjeff Apr 29, 2026
b13a0b6
feat(evaluator): Strands swarm uses shared HumanSimulator for all modes
harmjeff Apr 29, 2026
ed75bbc
fix(evaluator): separate executor and simulator token tracking for fu…
harmjeff Apr 29, 2026
34a85ca
fix(sdk-adapter): import _get_container_cli from shared.sandbox
harmjeff Apr 29, 2026
6501afc
fix(kiro-adapter): add post-run test evaluation (stage 2)
harmjeff Apr 29, 2026
45f6b4e
fix(sandbox): pin base image to Python 3.13 instead of 3.14
harmjeff Apr 29, 2026
6fe814b
fix(kiro-adapter): interactive stdin mode for genuine simulator gate …
harmjeff Apr 29, 2026
703414b
fix(kiro-adapter): replace select.select with thread-based reader
harmjeff Apr 30, 2026
025108a
fix(kiro-adapter): check construction docs after every chunk, not jus…
harmjeff Apr 30, 2026
56677d6
fix(kiro-adapter): kill process on completion instead of /quit + hand…
harmjeff Apr 30, 2026
89f5ea4
feat(kiro-adapter): print kiro agent turns to stderr
harmjeff Apr 30, 2026
9bada2f
fix(kiro-adapter): fix output filter — accumulate chars into lines, t…
harmjeff Apr 30, 2026
c7a7d09
fix(kiro-adapter): two-phase --resume approach for genuine simulator …
harmjeff Apr 30, 2026
f43e6a2
feat(kiro-adapter): per-stage simulator gates (4 review points)
harmjeff Apr 30, 2026
54a6ba9
refactor(cli-harness): plugin adapter registration + shared HumanSimu…
harmjeff Apr 30, 2026
5b2eb4d
docs(evaluator): add CLI adapter developer guide to ARCHITECTURE.md
harmjeff Apr 30, 2026
8e5de73
fix(security): inline nosemgrep suppressions for subprocess audit fin…
harmjeff Apr 30, 2026
a8c12bf
fix(security): add gitleaks allowlist for credential scrubber test fi…
harmjeff Apr 30, 2026
a603804
fix(docs): fix MD060 table alignment in ARCHITECTURE.md + gitleaks al…
harmjeff Apr 30, 2026
e00b731
fix(security): use fully-qualified semgrep rule IDs for suppressions
harmjeff Apr 30, 2026
b1cab54
fix(security): move nosemgrep to same line as subprocess call
harmjeff May 1, 2026
3e20c0a
docs: add CLAUDE.md with scan commands and semgrep suppression guidance
harmjeff May 1, 2026
2c41fb4
Revert "docs: add CLAUDE.md with scan commands and semgrep suppressio…
harmjeff May 1, 2026
c0922ea
fix(security): fix nosemgrep suppression format for subprocess calls
harmjeff May 5, 2026
14d96cd
fix(security): test inline nosemgrep comment style in run_git_compare.py
harmjeff May 5, 2026
1c8722a
Merge branch 'main' into fix/evaluator-update
harmjeff May 5, 2026
631ebf3
fix(security): update nosemgrep suppression in run_git_compare.py
harmjeff May 5, 2026
2c24b67
fix(security): test full semgrep rule ID suppression in run_git_compa…
harmjeff May 5, 2026
28ea9b6
fix(security): update subprocess suppression in run_git_compare.py
harmjeff May 5, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions scripts/aidlc-evaluator/.gitleaks.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Gitleaks configuration for aidlc-evaluator
# Suppress false positives from test fixtures that intentionally contain fake credentials.

[allowlist]
description = "Fake credentials used in test_credential_scrubber.py test fixtures"
paths = [
"packages/shared/tests/test_credential_scrubber.py",
]
118 changes: 118 additions & 0 deletions scripts/aidlc-evaluator/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -460,6 +460,36 @@

Supported adapters: Cursor, Cline, Copilot, Kiro, Windsurf, Antigravity.

### 6.5 CLI Evaluation (`run_cli_evaluation.py`)

Runs the AIDLC workflow through CLI-based AI assistants (Claude Code, Kiro CLI, etc.):

```text
load_adapters_from_config(cfg_data) ← register any custom adapters from config.yaml
get_adapter(name) ← lazy import from registry
├── check_prerequisites()
├── HumanSimulator built once by orchestrator (vision + tech_env + openapi injected)
├── adapter.run(config) ──► CLI-specific automation + simulator gate reviews
├── normalize_output() ──► standard run folder layout
└── run_evaluation.py --evaluate-only ──► stages 2-6
```

**Adapter pattern**: Each CLI tool is implemented as a subclass of `CLIAdapter` (`packages/cli-harness/src/cli_harness/adapter.py`) with three methods:

- `name` — human-readable identifier (e.g. `"kiro-cli"`)
- `check_prerequisites()` — verify the CLI tool is installed and credentials are valid
- `run(config: AdapterConfig) -> AdapterResult` — execute the AIDLC workflow and return results

**HumanSimulator injection**: The orchestrator constructs a single `HumanSimulator` with the full document context (vision, tech-env, OpenAPI spec) before calling the adapter. It is passed in as `config.simulator`. Adapters access it via `config.simulator.respond(message)` — they do not construct it themselves.

**Simulator gates**: Adapters use `config.simulator` to inject human-reviewer feedback at key workflow stages. The kiro-cli adapter uses 4 stage gates (requirements → design → code-gen plan → construction); the claude-code-sdk adapter intercepts `handoff_to_simulator` tool calls inline.

Check notice

Code scanning / Semgrep OSS

Semgrep Finding: ai.generic.detect-generic-ai-anthprop.detect-generic-ai-anthprop Note

Possibly found usage of AI: Anthropic

**Plugin registration**: Custom adapters can be added without modifying framework code — see [Adding a New CLI Adapter](#adding-a-new-cli-adapter) below.

Supported built-in adapters: `claude-code`, `claude-code-sdk`, `kiro-cli`.

Check notice

Code scanning / Semgrep OSS

Semgrep Finding: ai.generic.detect-generic-ai-anthprop.detect-generic-ai-anthprop Note

Possibly found usage of AI: Anthropic

Check notice

Code scanning / Semgrep OSS

Semgrep Finding: ai.generic.detect-generic-ai-anthprop.detect-generic-ai-anthprop Note

Possibly found usage of AI: Anthropic

---

## 7. Data Flow: YAML Artifact Graph
Expand Down Expand Up @@ -633,6 +663,94 @@
1. Create `config/<model-name>.yaml` with `models.executor.model_id` set to the Bedrock model ID
2. The batch runner will automatically discover it

### Adding a New CLI Adapter

CLI adapters live in `packages/cli-harness` and follow a plugin pattern — no framework code changes are needed.

**Step 1 — Implement the adapter**

Create a module anywhere importable (e.g. `packages/cli-harness/src/cli_harness/adapters/my_tool.py`):

```python
from cli_harness.adapter import AdapterConfig, AdapterResult, CLIAdapter

class MyToolAdapter(CLIAdapter):
@property
def name(self) -> str:
return "my-tool"

def check_prerequisites(self) -> tuple[bool, str]:
import shutil
if not shutil.which("my-tool"):
return False, "'my-tool' not found in PATH"
return True, "my-tool found"

def run(self, config: AdapterConfig) -> AdapterResult:
import time, shutil
from cli_harness.normalizer import normalize_output

start = time.monotonic()
workspace = config.output_dir / "workspace"
workspace.mkdir(parents=True, exist_ok=True)

# Copy inputs, inject rules, run the CLI tool...
# Use config.simulator.respond(message) at review gates.
simulator = config.simulator # pre-built with vision/tech_env/openapi context
if simulator is None:
raise RuntimeError("my-tool requires a simulator (set --simulator-model)")

# ... run CLI tool stages, call simulator.respond() between stages ...

elapsed = time.monotonic() - start
normalize_output(
source_dir=workspace,
output_dir=config.output_dir,
adapter_name=self.name,
elapsed_seconds=elapsed,
)
dst_docs = config.output_dir / "aidlc-docs"
return AdapterResult(
success=dst_docs.is_dir(),
output_dir=config.output_dir,
aidlc_docs_dir=dst_docs if dst_docs.is_dir() else None,
workspace_dir=workspace,
elapsed_seconds=elapsed,
)
```

**Step 2 — Register in config** (no framework edits needed)

Add one line to `config/default.yaml` (or your own config file):

```yaml
cli:
adapters:
my-tool: "cli_harness.adapters.my_tool.MyToolAdapter"
```

**Step 3 — Verify**

```bash
# Confirm it appears
uv run python run.py cli --list

# Check prerequisites
uv run python run.py cli --cli my-tool --check-only

# Run evaluation
uv run python run.py cli --cli my-tool --scenario sci-calc
```

**Key contracts for adapter implementors:**

| What | Where | Notes |
| ---------------- | ----------------------------------------------------------------------- | ------------------------------------------------------------------------- |
| Abstract base | `cli_harness/adapter.py` - `CLIAdapter` | Implement `name`, `check_prerequisites`, `run` |
| Simulator | `config.simulator` (`HumanSimulator`) | Call `.respond(message)` at review gates; never construct it yourself |
| Output layout | `cli_harness/normalizer.py` (`normalize_output()`) | Call at end of `run()` to write `run-meta.yaml` / `run-metrics.yaml` |
| Post-run tests | `aidlc_runner.post_run.run_post_evaluation()` | Optional; call after `normalize_output()` to run generated project tests |
| Document context | `config.vision_path`, `config.tech_env_path`, `config.openapi_content` | Available if needed; simulator already has this context |

### Adding a New IDE Adapter

1. Create `packages/ide-harness/src/ide_harness/adapters/<name>.py`
Expand Down
24 changes: 15 additions & 9 deletions scripts/aidlc-evaluator/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,14 @@ git checkout -b feature/your-feature-name

Work in the appropriate package:

- `aidlc-runner/` - Execution Framework (two-agent AIDLC workflow runner)
- `packages/execution/` - Execution Framework (two-agent AIDLC workflow runner)
- `packages/cli-harness/` - CLI Adapter Framework (Claude Code, Kiro CLI, custom tools)
- `packages/ide-harness/` - IDE Adapter Framework (Cursor, Cline, Kiro, etc.)
- `packages/qualitative/` - Semantic Evaluation (intent & design similarity scoring)
- `packages/quantitative/` - Code Evaluation (linting, security, organization)
- `packages/nonfunctional/` - NFR Evaluation (tokens, timing, consistency)
- `packages/reporting/` - Report generation
- `packages/trend-reports/` - Cross-release trend reporting
- `packages/shared/` - Common utilities

Or contribute to other work streams:
Expand Down Expand Up @@ -96,14 +99,17 @@ git commit -m "Add token tracking to nonfunctional package"

The project is organized around six big rocks. Your changes will typically fall into one or more of these:

| Work Stream | Description | Package / Area |
| ----------------------- | --------------------------------------------- | ------------------------- |
| **Golden Test Case** | Curated baseline test inputs | `test_cases/` |
| **Execution Framework** | Two-agent AIDLC workflow runner (Owner: Jeff) | `aidlc-runner/` |
| **Semantic Evaluation** | Intent & design similarity scoring | `packages/qualitative/` |
| **Code Evaluation** | Linting, security, organization | `packages/quantitative/` |
| **NFR Evaluation** | Tokens, timing, consistency | `packages/nonfunctional/` |
| **GitHub CI/CD** | Pipeline integration & management | `.github/workflows/` |
| Work Stream | Description | Package / Area |
| ----------------------- | --------------------------------------------- | ---------------------------- |
| **Golden Test Case** | Curated baseline test inputs | `test_cases/` |
| **Execution Framework** | Two-agent AIDLC workflow runner | `packages/execution/` |
| **CLI Adapters** | CLI tool integrations (Claude Code, Kiro CLI) | `packages/cli-harness/` |
| **IDE Adapters** | IDE tool integrations (Cursor, Cline, etc.) | `packages/ide-harness/` |
| **Semantic Evaluation** | Intent & design similarity scoring | `packages/qualitative/` |
| **Code Evaluation** | Linting, security, organization | `packages/quantitative/` |
| **NFR Evaluation** | Tokens, timing, consistency | `packages/nonfunctional/` |
| **Trend Reporting** | Cross-release metric tracking | `packages/trend-reports/` |
| **GitHub CI/CD** | Pipeline integration & management | `.github/workflows/` |

## Code Standards

Expand Down
3 changes: 3 additions & 0 deletions scripts/aidlc-evaluator/config/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,6 @@ execution:

tools:
pmd_path: null # Path to PMD executable; if null, looks for 'pmd' on PATH

cli:
adapters: {} # Register custom CLI adapters: name: "mypackage.MyAdapter"
4 changes: 2 additions & 2 deletions scripts/aidlc-evaluator/docker/sandbox/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Multi-language sandbox image for running AI-generated code in isolation.
#
# Includes Python 3.14 + uv, Node.js 22 + npm, and common build tools.
# Includes Python 3.13 + uv, Node.js 22 + npm, and common build tools.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was there something that caused 3.14 -> 3.13?

# Runs as a non-root user with no credentials or host tools.
#
# Security notes:
Expand All @@ -9,7 +9,7 @@

# checkov:skip=CKV_DOCKER_2:HEALTHCHECK not needed for ephemeral test sandbox
# nosemgrep: dockerfile-source-not-pinned
FROM public.ecr.aws/docker/library/python:3.14-slim@sha256:3989a23fd2c28a34c7be819e488b958a10601d421ac25bea1e7a5d757365e2d5 AS base
FROM public.ecr.aws/docker/library/python:3.13-slim@sha256:8922791069fdfdd6056cf7f418a8655d970862d1972570d4c0e78dfc43afacd6 AS base

# Install system dependencies and Node.js 22
# nosemgrep: set-pipefail
Expand Down
2 changes: 2 additions & 0 deletions scripts/aidlc-evaluator/packages/cli-harness/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
requires-python = ">=3.13"
dependencies = [
"pyyaml>=6.0",
"anthropic[bedrock]>=0.40",

Check notice

Code scanning / Semgrep OSS

Semgrep Finding: ai.generic.detect-generic-ai-anthprop.detect-generic-ai-anthprop Note

Possibly found usage of AI: Anthropic
"boto3>=1.42.47",
]

[project.optional-dependencies]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from pathlib import Path
from typing import TYPE_CHECKING

if TYPE_CHECKING:
from cli_harness.simulator import HumanSimulator


@dataclass
Expand All @@ -17,7 +21,11 @@ class AdapterConfig:
tech_env_path: Path | None = None
prompt_template: str | None = None
model: str | None = None
simulator_model: str | None = None # kept for backwards compat; prefer simulator field
aws_profile: str | None = None
aws_region: str | None = None
openapi_content: str | None = None # injected into prompt/simulator for contract validation
simulator: "HumanSimulator | None" = None # pre-built by orchestrator; shared across adapters
timeout_seconds: int = 7200 # 2 hours max


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -128,8 +128,11 @@ def run(self, config: AdapterConfig) -> AdapterResult:
shutil.copy2(rules_path, rules_dir / rules_path.name)
_log(f"Copied AIDLC rules file: {rules_path.name}")

# Build the prompt
prompt = config.prompt_template or render_prompt()
# Build the prompt — inject OpenAPI spec so the self-approving executor
# has the full contract in view during design and code review.
prompt = config.prompt_template or render_prompt(
openapi_content=config.openapi_content,
)

# Build command — claude -p for non-interactive print mode
cmd = [
Expand Down
Loading
Loading