Pentest Executor

AI-driven penetration testing execution environment. A Kali Linux Docker container runs an MCP server exposing pentest tools. A host-side TypeScript agent uses Claude to generate PoC scripts or run fully autonomous multi-turn attack loops — planning, installing tools, learning usage, executing scans, and recovering from errors automatically.

Architecture

┌─────────────────────────┐        MCP (Streamable HTTP)        ┌─────────────────────────┐
│       Host Machine      │ ──────────── :3001 ────────────────> │   Kali Docker Container │
│                         │                                      │                         │
│  src/index.ts (CLI)     │                                      │  server.py (MCP Server) │
│  src/executor.ts        │  execute_shell_cmd / write_file      │  - execute_shell_cmd    │
│  src/mcp-client.ts      │  execute_script / manage_packages    │  - write_file           │
│  src/skills.ts          │ <────────── results ────────────────>│  - execute_script       │
│                         │                                      │  - manage_packages      │
│  Claude API (Anthropic) │                                      │                         │
│                         │                                      │  nmap, nikto, hydra,    │
│  skills/*.md (library)  │                                      │  sqlmap, gobuster,      │
│                         │                                      │  whatweb, netcat, python │
└─────────────────────────┘                                      └─────────────────────────┘

Project Structure

pentest-executor/
├── docker-compose.yml          # Kali service with volume mounts + apt cache volumes
├── .env.example                # Environment variable template
├── package.json                # Node.js project (ES modules)
├── tsconfig.json               # TypeScript configuration
├── kali/
│   ├── Dockerfile              # Kali Linux + pentest tools + Python MCP server
│   ├── server.py               # MCP server (4 tools, Streamable HTTP, port 3001)
│   └── requirements.txt        # mcp[http], uvicorn, requests, pwntools
├── src/
│   ├── index.ts                # CLI entry point (interactive menu, 7 options)
│   ├── executor.ts             # Core engine: single-shot generation + strategy-selectable agentic loop
│   ├── mcp-client.ts           # MCP client wrapper (Streamable HTTP transport)
│   ├── skills.ts               # Skill library: list, read, save, index skills
│   ├── types.ts                # TypeScript interfaces (TacticalPlan, AgentResult, etc.)
│   └── utils/
│       └── instrumentation.ts  # OpenTelemetry SDK + Langfuse span processor init
├── skills/                     # Persistent skill library (.md files with YAML frontmatter)
│   ├── wpscan.md               # WordPress scanner usage guide
│   └── github-search.md        # GitHub API PoC search patterns and security audit checklist
├── test/
│   └── test-wpscan-ooda.ts     # OODA loop functional test (mock agent + assertions)
├── logs/                       # Volume-mounted — persisted reports
└── scripts/                    # Volume-mounted — persisted PoC scripts

Quick Start

Prerequisites

Docker & Docker Compose
Node.js >= 18
An Anthropic API key

1. Install Node.js Dependencies

npm install

2. Configure Environment

cp .env.example .env

Edit .env and set your ANTHROPIC_API_KEY.

3. Build & Start the Kali Container

# Build the image (first time or after Dockerfile changes)
docker compose build

# Start the container in the background
docker compose up -d

This will:

Pull the kalilinux/kali-rolling base image
Install pentest tools: nmap, nikto, hydra, sqlmap, gobuster, whatweb, netcat
Install Python dependencies: mcp[http], uvicorn, requests, pwntools
Start the MCP server listening on port 3001

4. Verify the MCP Server is Running

# Check container status
docker compose ps

# Check server logs
docker compose logs -f kali

# Quick health check — should return 405 (Method Not Allowed = server is up)
curl -s -o /dev/null -w "%{http_code}" http://localhost:3001/mcp

5. Run the CLI

npm run start
# or directly:
npx tsx src/index.ts

6. Run the Test Suite

npx tsx test/test-wpscan-ooda.ts

Container Management

# Stop the container
docker compose down

# Rebuild after changing Dockerfile or server.py
docker compose up -d --build

# Shell into the running container
docker exec -it kali-pentest /bin/bash

# View real-time server logs
docker compose logs -f kali

Environment Variables

Variable	Default	Description
`ANTHROPIC_API_KEY`	(required)	Anthropic API key for Claude
`MCP_SERVER_URL`	`http://localhost:3001`	MCP server endpoint
`CLAUDE_MODEL`	`claude-sonnet-4-5-20250929`	Claude model to use
`LANGFUSE_SECRET_KEY`	(optional)	Langfuse secret key — enables tracing when set
`LANGFUSE_PUBLIC_KEY`	(optional)	Langfuse public key — enables tracing when set
`LANGFUSE_BASE_URL`	`https://cloud.langfuse.com`	Langfuse API endpoint

Langfuse Observability

All execution paths are instrumented with Langfuse via OpenTelemetry for full observability of every Claude API call, tool dispatch, and agentic turn.

Setup

Create a Langfuse account at cloud.langfuse.com (or self-host)
Add your keys to .env:

LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_BASE_URL=https://cloud.langfuse.com

Tracing is optional — if the keys are not set, the app runs normally with tracing disabled.

What Gets Traced

Every executor method produces a hierarchical trace:

Method	Trace Structure
`generateScript()`	`generate-script` — single span with token usage
`autoExecute()`	`auto-execute` → `generate-script` → `write-file` → `execute-script`
`executeFinal()`	`execute-final` → `execute-script`
`generateScriptFromPlan()`	`generate-script-from-plan` — single span with plan metadata + token usage
`autoExecuteFromPlan()`	`auto-execute-from-plan` → `generate-script-from-plan` → `write-file` → `execute-script`
`runAgentLoop()`	Full hierarchical trace (see below)

Agentic Loop Trace Hierarchy

The autonomous mode (runAgentLoop) produces the richest traces:

agent-loop (root)
├── turn-1
│   ├── claude-api-call    (input/output tokens, stop_reason)
│   ├── tool-execute_shell_cmd   (args, result preview)
│   └── tool-write_file          (args, result preview)
├── turn-2
│   ├── claude-api-call
│   └── tool-execute_script
├── ...
└── turn-N
    └── claude-api-call    (end_turn)

Each span records:

Input/output data for debugging
Token usage (input + output tokens per API call, cumulative on root)
Tool arguments and result previews for every MCP dispatch
Session ID and tags for filtering in the Langfuse dashboard

CLI Options

The interactive menu provides seven top-level options. Option 5 has four sub-strategies.

Options Overview

Option	Name	Mode	Human Review	Executor Method(s)
1	Generate Script	Single-shot	N/A (generate only)	`generateScript()`
2	Execute Script	Direct MCP	N/A (execute only)	(direct `mcp.executeScript()`)
3	Interactive Run	Single-shot	Yes	`generateScript()` → `executeFinal()`
4	Auto Run	Single-shot	No	`autoExecute()`
5.1	Tool-based Autonomy	Agentic loop	No	`runAgentWithTacticalPlan(plan, "tool")` → `runAgentLoop()`
5.2	GitHub PoC Search	Agentic loop	No	`runAgentWithTacticalPlan(plan, "github")` → `runAgentLoop()`
5.3	Manual Construction	Single-shot	No	`autoExecuteFromPlan()`
5.4	Interactive Plan	Single-shot	Yes	`generateScriptFromPlan()` → `executeFinal()`
6	Autonomous Run	Agentic loop	No	`runAgentLoop()`
7	Exit	—	—	—

Option 1: Generate Script

Describe a pentest task in natural language. Claude generates a Python exploit script and displays it. You can optionally save it to the Kali container with a custom or auto-generated filename.

Flow: User input → generateScript() → display → optional mcp.writeFile()

Option 2: Execute Script

Run an existing script that is already present in the Kali container's /app/scripts/ directory. Provide the filename and optional command-line arguments.

Flow: Filename + args → mcp.executeScript() → display result

Option 3: Interactive Run (Human-in-the-Loop)

Generate a script, save it locally to .tmp_payload.py for review/editing in your editor, then execute the finalized version after you press Enter.

Flow: User input → generateScript() → save to .tmp_payload.py → user reviews/edits → executeFinal() → display result

Option 4: Auto Run

Fully automated: generate a script with Claude, write it to Kali, and execute it immediately with no human review.

Flow: User input → autoExecute() (generate → write → execute) → display result

Option 5: Load Tactical Plan

Load a JSON Tactical Plan file. Displays a summary with CVE, MITRE ATT&CK ID, confidence score, rationale, and tool. Then select one of four execution strategies:

5.1 Tool-based Autonomy — Agent uses searchsploit, Metasploit, and other Kali tools to exploit the target. Full plan context injected.
5.2 GitHub PoC Search — Agent searches GitHub API for public PoC exploits, clones them, performs security audit, adapts parameters, and executes.
5.3 Manual Construction — Claude generates a Python exploit from the plan context, writes to Kali, and executes immediately (no agent loop).
5.4 Interactive — Claude generates a script from the plan, saves locally for review/edit, then executes after user confirmation.

Option 6: Autonomous Run (Agentic Loop)

Full multi-turn agentic loop from a free-form task description. Claude plans, executes tools, installs missing packages, handles errors, learns tools, saves skill files, and retries autonomously.

Flow: Task + max turns → runAgentLoop() → multi-turn OODA cycle → display summary + report

Option 7: Exit

Gracefully shuts down the MCP connection, flushes Langfuse traces, and exits.

Autonomous Mode (Agentic Loop)

Option 6 enables a multi-turn agentic execution mode where Claude operates in a closed OODA loop:

User Task ──► Claude plans ──► Tool calls ──► Observe results ──► Iterate
                  │                                                   │
                  └────────────── until task complete ────────────────┘

OODA Cycle

When the agent encounters a missing tool, it follows this protocol automatically:

Observe — Run which <tool> to check availability
Orient — manage_packages(check) to confirm the tool is missing
Decide — manage_packages(install) to install via apt-get
Act — Read --help, save a skill file, then execute the original task

Example: WordPress Scan

User: "Scan http://192.168.1.50 for WordPress vulnerabilities"

Turn 1: execute_shell_cmd("which wpscan")         → not found
Turn 2: manage_packages(check, "wpscan")           → MISSING
Turn 3: manage_packages(install, "wpscan")         → SUCCESS
Turn 4: execute_shell_cmd("wpscan --help | head -80") → learns flags
Turn 5: save_new_skill("wpscan", ...) + execute_shell_cmd("wpscan --url ...") → parallel
Turn 6: Final report with users, plugins, versions, severity ratings

Tactical Plan Strategies

Option 5 loads a JSON Tactical Plan file and presents a strategy selection menu. The plan summary displays CVE ID, MITRE ATT&CK technique, confidence score, rationale tags, and recommended tool before the user selects a strategy.

Strategy 1: Tool-based Autonomy

The agent uses existing Kali tools (Metasploit, searchsploit) to exploit the target. The full Tactical Plan context (all 6 sections: TARGET, VULNERABLE ENDPOINT, PAYLOAD, EXPLOITATION STRATEGY, VULNERABILITY CONTEXT, SUCCESS CRITERIA) is injected into the prompt. The agent follows a 4-phase protocol:

Recon & Tool Selection — searchsploit <CVE>, msfconsole -q -x 'search <CVE>; exit'
Installation & Verification — manage_packages to check/install tools
Configuration & Execution — Non-interactive msfconsole -q -x '...; exit' or standalone exploits
Verification & Reporting — Validate results against success criteria, save to /app/logs/

Strategy 2: GitHub PoC Search

The agent searches GitHub for public proof-of-concept exploits via the GitHub Search API. It follows a 4-phase protocol:

GitHub API Search — curl queries to find CVE-related repositories sorted by stars
Clone & Inspect — Shallow clone, README review, mandatory security audit of exploit code
Adapt for Target — Modify IP/port, install dependencies, adjust parameters
Execute & Verify — Run the adapted PoC, validate against success criteria

The agent can load the github-search skill file for detailed API patterns, rate limiting guidance, and security audit checklists.

Strategy 3: Manual Construction

Single-shot AI code generation using autoExecuteFromPlan(). Claude generates a complete Python exploit script from the plan context, writes it to Kali, and executes immediately with no human review.

Strategy 4: Interactive

Human-in-the-loop workflow. Claude generates a script from the plan, saves it to .tmp_payload.py for the user to review and edit, then executes the finalized version after user confirmation.

MCP Server Tools

The Kali container exposes four MCP tools via Streamable HTTP on port 3001:

`write_file(filename, content)`

Writes a file to /app/scripts/ inside the container. Files are volume-mounted to ./scripts/ on the host.

`execute_script(filename, args?)`

Runs python3 /app/scripts/<filename> with optional arguments. Output is captured (stdout + stderr) and truncated to 4000 chars. Timeout: 120 seconds.

`execute_shell_cmd(command)`

Executes an arbitrary shell command inside the Kali container. Returns exit code, stdout, and stderr. Timeout: 120 seconds. This is the foundation for the OODA loop — the agent runs ad-hoc commands like which, --help, scanning tools, etc.

`manage_packages(action, package_name)`

Package management with two actions:

check — verifies if a package is installed (uses shutil.which)
install — runs apt-get update && apt-get install -y <package> with a 300s timeout

Package names are validated against [a-z0-9\-] to prevent injection. All install actions are logged to /app/logs/installs.log.

Skill Library

The agent maintains a persistent skill library in ./skills/*.md. Each skill file contains YAML frontmatter and markdown body:

---
tool_name: "wpscan"
category: "web_scanner"
tags: ["wordpress", "cms", "enumeration"]
description: "WordPress security scanner for users, plugins, and themes."
---
# WPScan
## Key Commands
- Full scan: `wpscan --url <target> --enumerate u,p,t --force`
## Anti-Patterns
- Do not use on non-WordPress sites

Skills are indexed at startup and injected into the agent's system prompt. The agent can:

list_skills() — view all available skills
read_skill_file(tool_name) — load full usage guide before using a tool
save_new_skill(tool_name, content) — persist new knowledge after learning a tool

Built-in Skills

Skill	Category	Description
`wpscan`	web_scanner	WordPress security scanner for users, plugins, and themes
`github-search`	recon	Search GitHub for CVE proof-of-concept exploits via API and clone/adapt them

Docker Persistence

Named volumes persist apt package cache across container restarts:

volumes:
  - kali_apt_cache:/var/cache/apt    # cached .deb files
  - kali_apt_lib:/var/lib/apt        # package index

This means apt-get install re-runs after container recreation but skips re-downloading packages. The agent handles re-installation automatically via the OODA loop.

Design Decisions

Streamable HTTP transport — chosen over stdio/SSE for future cloud deployment of the Kali container on a remote host.
Volume mounts — ./logs, ./scripts, and apt caches persist across container restarts.
network_mode: host — allows the Kali container to scan the host's local network (Linux only).
Output truncation — prevents large tool outputs (nmap, gobuster) from blowing up LLM context.
Skill library — only the index (name + description) goes into the system prompt; full content is loaded on-demand to avoid prompt bloat.
maxTurns cap — agentic loop hard-capped at configurable turn limit (default 15) to prevent token burn.

Testing

The test suite validates the full OODA lifecycle with mocked API and MCP responses:

npx tsx test/test-wpscan-ooda.ts

✅ PASS: Agent checked if wpscan exists (which wpscan)
✅ PASS: Agent verified wpscan is missing (manage_packages check)
✅ PASS: Agent installed wpscan (manage_packages install)
✅ PASS: Agent learned wpscan usage (--help)
✅ PASS: Agent saved wpscan skill file (save_new_skill)
✅ PASS: Agent executed wpscan against http://192.168.1.50
✅ PASS: Final report mentions discovered user 'admin'
✅ PASS: Final report mentions WordPress version 5.8.1
✅ PASS: Completed within turn budget (used 6 of 10)
✅ PASS: wpscan.md skill file exists on disk
✅ PASS: wpscan appears in the skill index
✅ PASS: Skill file has correct category (web_scanner)
✅ PASS: Skill file has wordpress tag
✅ PASS: OODA phases happened in correct order: discover → check → install → learn → execute

RESULTS: 14 passed, 0 failed, 14 total

Example

Select option: 4
Describe the pentest task: scan ports 80, 443, 8080 on 127.0.0.1 using socket

[1/3] Generating script with Claude...
[2/3] Writing script to Kali container...
  File written: /app/scripts/poc_1738000000000.py (512 bytes)
[3/3] Executing script...

--- Execution Result ---
Exit code: 0
--- STDOUT ---
Port 80: OPEN
Port 443: CLOSED
Port 8080: CLOSED

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
kali		kali
logs		logs
scripts		scripts
skills		skills
src		src
test		test
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
Tactical.json		Tactical.json
autonomy-design.md		autonomy-design.md
docker-compose.yml		docker-compose.yml
login.py		login.py
package-lock.json		package-lock.json
package.json		package.json
payload.py		payload.py
payload_llm.py		payload_llm.py
plan.md		plan.md
playbook.txt		playbook.txt
prompt_enhance.md		prompt_enhance.md
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

Pentest Executor

Architecture

Project Structure

Quick Start

Prerequisites

1. Install Node.js Dependencies

2. Configure Environment

3. Build & Start the Kali Container

4. Verify the MCP Server is Running

5. Run the CLI

6. Run the Test Suite

Container Management

Environment Variables

Langfuse Observability

Setup

What Gets Traced

Agentic Loop Trace Hierarchy

CLI Options

Options Overview

Option 1: Generate Script

Option 2: Execute Script

Option 3: Interactive Run (Human-in-the-Loop)

Option 4: Auto Run

Option 5: Load Tactical Plan

Option 6: Autonomous Run (Agentic Loop)

Option 7: Exit

Autonomous Mode (Agentic Loop)

OODA Cycle

Example: WordPress Scan

Tactical Plan Strategies

Strategy 1: Tool-based Autonomy

Strategy 2: GitHub PoC Search

Strategy 3: Manual Construction

Strategy 4: Interactive

MCP Server Tools

write_file(filename, content)

execute_script(filename, args?)

execute_shell_cmd(command)

manage_packages(action, package_name)

Skill Library

Built-in Skills

Docker Persistence

Design Decisions

Testing

Example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`write_file(filename, content)`

`execute_script(filename, args?)`

`execute_shell_cmd(command)`

`manage_packages(action, package_name)`

Packages