AI-driven penetration testing execution environment. A Kali Linux Docker container runs an MCP server exposing pentest tools. A host-side TypeScript agent uses Claude to generate PoC scripts or run fully autonomous multi-turn attack loops — planning, installing tools, learning usage, executing scans, and recovering from errors automatically.
┌─────────────────────────┐ MCP (Streamable HTTP) ┌─────────────────────────┐
│ Host Machine │ ──────────── :3001 ────────────────> │ Kali Docker Container │
│ │ │ │
│ src/index.ts (CLI) │ │ server.py (MCP Server) │
│ src/executor.ts │ execute_shell_cmd / write_file │ - execute_shell_cmd │
│ src/mcp-client.ts │ execute_script / manage_packages │ - write_file │
│ src/skills.ts │ <────────── results ────────────────>│ - execute_script │
│ │ │ - manage_packages │
│ Claude API (Anthropic) │ │ │
│ │ │ nmap, nikto, hydra, │
│ skills/*.md (library) │ │ sqlmap, gobuster, │
│ │ │ whatweb, netcat, python │
└─────────────────────────┘ └─────────────────────────┘
pentest-executor/
├── docker-compose.yml # Kali service with volume mounts + apt cache volumes
├── .env.example # Environment variable template
├── package.json # Node.js project (ES modules)
├── tsconfig.json # TypeScript configuration
├── kali/
│ ├── Dockerfile # Kali Linux + pentest tools + Python MCP server
│ ├── server.py # MCP server (4 tools, Streamable HTTP, port 3001)
│ └── requirements.txt # mcp[http], uvicorn, requests, pwntools
├── src/
│ ├── index.ts # CLI entry point (interactive menu, 7 options)
│ ├── executor.ts # Core engine: single-shot generation + strategy-selectable agentic loop
│ ├── mcp-client.ts # MCP client wrapper (Streamable HTTP transport)
│ ├── skills.ts # Skill library: list, read, save, index skills
│ ├── types.ts # TypeScript interfaces (TacticalPlan, AgentResult, etc.)
│ └── utils/
│ └── instrumentation.ts # OpenTelemetry SDK + Langfuse span processor init
├── skills/ # Persistent skill library (.md files with YAML frontmatter)
│ ├── wpscan.md # WordPress scanner usage guide
│ └── github-search.md # GitHub API PoC search patterns and security audit checklist
├── test/
│ └── test-wpscan-ooda.ts # OODA loop functional test (mock agent + assertions)
├── logs/ # Volume-mounted — persisted reports
└── scripts/ # Volume-mounted — persisted PoC scripts
- Docker & Docker Compose
- Node.js >= 18
- An Anthropic API key
npm installcp .env.example .envEdit .env and set your ANTHROPIC_API_KEY.
# Build the image (first time or after Dockerfile changes)
docker compose build
# Start the container in the background
docker compose up -dThis will:
- Pull the
kalilinux/kali-rollingbase image - Install pentest tools:
nmap,nikto,hydra,sqlmap,gobuster,whatweb,netcat - Install Python dependencies:
mcp[http],uvicorn,requests,pwntools - Start the MCP server listening on port 3001
# Check container status
docker compose ps
# Check server logs
docker compose logs -f kali
# Quick health check — should return 405 (Method Not Allowed = server is up)
curl -s -o /dev/null -w "%{http_code}" http://localhost:3001/mcpnpm run start
# or directly:
npx tsx src/index.tsnpx tsx test/test-wpscan-ooda.ts# Stop the container
docker compose down
# Rebuild after changing Dockerfile or server.py
docker compose up -d --build
# Shell into the running container
docker exec -it kali-pentest /bin/bash
# View real-time server logs
docker compose logs -f kali| Variable | Default | Description |
|---|---|---|
ANTHROPIC_API_KEY |
(required) | Anthropic API key for Claude |
MCP_SERVER_URL |
http://localhost:3001 |
MCP server endpoint |
CLAUDE_MODEL |
claude-sonnet-4-5-20250929 |
Claude model to use |
LANGFUSE_SECRET_KEY |
(optional) | Langfuse secret key — enables tracing when set |
LANGFUSE_PUBLIC_KEY |
(optional) | Langfuse public key — enables tracing when set |
LANGFUSE_BASE_URL |
https://cloud.langfuse.com |
Langfuse API endpoint |
All execution paths are instrumented with Langfuse via OpenTelemetry for full observability of every Claude API call, tool dispatch, and agentic turn.
- Create a Langfuse account at cloud.langfuse.com (or self-host)
- Add your keys to
.env:
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_BASE_URL=https://cloud.langfuse.comTracing is optional — if the keys are not set, the app runs normally with tracing disabled.
Every executor method produces a hierarchical trace:
| Method | Trace Structure |
|---|---|
generateScript() |
generate-script — single span with token usage |
autoExecute() |
auto-execute → generate-script → write-file → execute-script |
executeFinal() |
execute-final → execute-script |
generateScriptFromPlan() |
generate-script-from-plan — single span with plan metadata + token usage |
autoExecuteFromPlan() |
auto-execute-from-plan → generate-script-from-plan → write-file → execute-script |
runAgentLoop() |
Full hierarchical trace (see below) |
The autonomous mode (runAgentLoop) produces the richest traces:
agent-loop (root)
├── turn-1
│ ├── claude-api-call (input/output tokens, stop_reason)
│ ├── tool-execute_shell_cmd (args, result preview)
│ └── tool-write_file (args, result preview)
├── turn-2
│ ├── claude-api-call
│ └── tool-execute_script
├── ...
└── turn-N
└── claude-api-call (end_turn)
Each span records:
- Input/output data for debugging
- Token usage (input + output tokens per API call, cumulative on root)
- Tool arguments and result previews for every MCP dispatch
- Session ID and tags for filtering in the Langfuse dashboard
The interactive menu provides seven top-level options. Option 5 has four sub-strategies.
| Option | Name | Mode | Human Review | Executor Method(s) |
|---|---|---|---|---|
| 1 | Generate Script | Single-shot | N/A (generate only) | generateScript() |
| 2 | Execute Script | Direct MCP | N/A (execute only) | (direct mcp.executeScript()) |
| 3 | Interactive Run | Single-shot | Yes | generateScript() → executeFinal() |
| 4 | Auto Run | Single-shot | No | autoExecute() |
| 5.1 | Tool-based Autonomy | Agentic loop | No | runAgentWithTacticalPlan(plan, "tool") → runAgentLoop() |
| 5.2 | GitHub PoC Search | Agentic loop | No | runAgentWithTacticalPlan(plan, "github") → runAgentLoop() |
| 5.3 | Manual Construction | Single-shot | No | autoExecuteFromPlan() |
| 5.4 | Interactive Plan | Single-shot | Yes | generateScriptFromPlan() → executeFinal() |
| 6 | Autonomous Run | Agentic loop | No | runAgentLoop() |
| 7 | Exit | — | — | — |
Describe a pentest task in natural language. Claude generates a Python exploit script and displays it. You can optionally save it to the Kali container with a custom or auto-generated filename.
Flow: User input → generateScript() → display → optional mcp.writeFile()
Run an existing script that is already present in the Kali container's /app/scripts/ directory. Provide the filename and optional command-line arguments.
Flow: Filename + args → mcp.executeScript() → display result
Generate a script, save it locally to .tmp_payload.py for review/editing in your editor, then execute the finalized version after you press Enter.
Flow: User input → generateScript() → save to .tmp_payload.py → user reviews/edits → executeFinal() → display result
Fully automated: generate a script with Claude, write it to Kali, and execute it immediately with no human review.
Flow: User input → autoExecute() (generate → write → execute) → display result
Load a JSON Tactical Plan file. Displays a summary with CVE, MITRE ATT&CK ID, confidence score, rationale, and tool. Then select one of four execution strategies:
- 5.1 Tool-based Autonomy — Agent uses searchsploit, Metasploit, and other Kali tools to exploit the target. Full plan context injected.
- 5.2 GitHub PoC Search — Agent searches GitHub API for public PoC exploits, clones them, performs security audit, adapts parameters, and executes.
- 5.3 Manual Construction — Claude generates a Python exploit from the plan context, writes to Kali, and executes immediately (no agent loop).
- 5.4 Interactive — Claude generates a script from the plan, saves locally for review/edit, then executes after user confirmation.
Full multi-turn agentic loop from a free-form task description. Claude plans, executes tools, installs missing packages, handles errors, learns tools, saves skill files, and retries autonomously.
Flow: Task + max turns → runAgentLoop() → multi-turn OODA cycle → display summary + report
Gracefully shuts down the MCP connection, flushes Langfuse traces, and exits.
Option 6 enables a multi-turn agentic execution mode where Claude operates in a closed OODA loop:
User Task ──► Claude plans ──► Tool calls ──► Observe results ──► Iterate
│ │
└────────────── until task complete ────────────────┘
When the agent encounters a missing tool, it follows this protocol automatically:
- Observe — Run
which <tool>to check availability - Orient —
manage_packages(check)to confirm the tool is missing - Decide —
manage_packages(install)to install via apt-get - Act — Read
--help, save a skill file, then execute the original task
User: "Scan http://192.168.1.50 for WordPress vulnerabilities"
Turn 1: execute_shell_cmd("which wpscan") → not found
Turn 2: manage_packages(check, "wpscan") → MISSING
Turn 3: manage_packages(install, "wpscan") → SUCCESS
Turn 4: execute_shell_cmd("wpscan --help | head -80") → learns flags
Turn 5: save_new_skill("wpscan", ...) + execute_shell_cmd("wpscan --url ...") → parallel
Turn 6: Final report with users, plugins, versions, severity ratings
Option 5 loads a JSON Tactical Plan file and presents a strategy selection menu. The plan summary displays CVE ID, MITRE ATT&CK technique, confidence score, rationale tags, and recommended tool before the user selects a strategy.
The agent uses existing Kali tools (Metasploit, searchsploit) to exploit the target. The full Tactical Plan context (all 6 sections: TARGET, VULNERABLE ENDPOINT, PAYLOAD, EXPLOITATION STRATEGY, VULNERABILITY CONTEXT, SUCCESS CRITERIA) is injected into the prompt. The agent follows a 4-phase protocol:
- Recon & Tool Selection —
searchsploit <CVE>,msfconsole -q -x 'search <CVE>; exit' - Installation & Verification —
manage_packagesto check/install tools - Configuration & Execution — Non-interactive
msfconsole -q -x '...; exit'or standalone exploits - Verification & Reporting — Validate results against success criteria, save to
/app/logs/
The agent searches GitHub for public proof-of-concept exploits via the GitHub Search API. It follows a 4-phase protocol:
- GitHub API Search —
curlqueries to find CVE-related repositories sorted by stars - Clone & Inspect — Shallow clone, README review, mandatory security audit of exploit code
- Adapt for Target — Modify IP/port, install dependencies, adjust parameters
- Execute & Verify — Run the adapted PoC, validate against success criteria
The agent can load the github-search skill file for detailed API patterns, rate limiting guidance, and security audit checklists.
Single-shot AI code generation using autoExecuteFromPlan(). Claude generates a complete Python exploit script from the plan context, writes it to Kali, and executes immediately with no human review.
Human-in-the-loop workflow. Claude generates a script from the plan, saves it to .tmp_payload.py for the user to review and edit, then executes the finalized version after user confirmation.
The Kali container exposes four MCP tools via Streamable HTTP on port 3001:
Writes a file to /app/scripts/ inside the container. Files are volume-mounted to ./scripts/ on the host.
Runs python3 /app/scripts/<filename> with optional arguments. Output is captured (stdout + stderr) and truncated to 4000 chars. Timeout: 120 seconds.
Executes an arbitrary shell command inside the Kali container. Returns exit code, stdout, and stderr. Timeout: 120 seconds. This is the foundation for the OODA loop — the agent runs ad-hoc commands like which, --help, scanning tools, etc.
Package management with two actions:
check— verifies if a package is installed (usesshutil.which)install— runsapt-get update && apt-get install -y <package>with a 300s timeout
Package names are validated against [a-z0-9\-] to prevent injection. All install actions are logged to /app/logs/installs.log.
The agent maintains a persistent skill library in ./skills/*.md. Each skill file contains YAML frontmatter and markdown body:
---
tool_name: "wpscan"
category: "web_scanner"
tags: ["wordpress", "cms", "enumeration"]
description: "WordPress security scanner for users, plugins, and themes."
---
# WPScan
## Key Commands
- Full scan: `wpscan --url <target> --enumerate u,p,t --force`
## Anti-Patterns
- Do not use on non-WordPress sitesSkills are indexed at startup and injected into the agent's system prompt. The agent can:
list_skills()— view all available skillsread_skill_file(tool_name)— load full usage guide before using a toolsave_new_skill(tool_name, content)— persist new knowledge after learning a tool
| Skill | Category | Description |
|---|---|---|
wpscan |
web_scanner | WordPress security scanner for users, plugins, and themes |
github-search |
recon | Search GitHub for CVE proof-of-concept exploits via API and clone/adapt them |
Named volumes persist apt package cache across container restarts:
volumes:
- kali_apt_cache:/var/cache/apt # cached .deb files
- kali_apt_lib:/var/lib/apt # package indexThis means apt-get install re-runs after container recreation but skips re-downloading packages. The agent handles re-installation automatically via the OODA loop.
- Streamable HTTP transport — chosen over stdio/SSE for future cloud deployment of the Kali container on a remote host.
- Volume mounts —
./logs,./scripts, and apt caches persist across container restarts. network_mode: host— allows the Kali container to scan the host's local network (Linux only).- Output truncation — prevents large tool outputs (nmap, gobuster) from blowing up LLM context.
- Skill library — only the index (name + description) goes into the system prompt; full content is loaded on-demand to avoid prompt bloat.
- maxTurns cap — agentic loop hard-capped at configurable turn limit (default 15) to prevent token burn.
The test suite validates the full OODA lifecycle with mocked API and MCP responses:
npx tsx test/test-wpscan-ooda.ts✅ PASS: Agent checked if wpscan exists (which wpscan)
✅ PASS: Agent verified wpscan is missing (manage_packages check)
✅ PASS: Agent installed wpscan (manage_packages install)
✅ PASS: Agent learned wpscan usage (--help)
✅ PASS: Agent saved wpscan skill file (save_new_skill)
✅ PASS: Agent executed wpscan against http://192.168.1.50
✅ PASS: Final report mentions discovered user 'admin'
✅ PASS: Final report mentions WordPress version 5.8.1
✅ PASS: Completed within turn budget (used 6 of 10)
✅ PASS: wpscan.md skill file exists on disk
✅ PASS: wpscan appears in the skill index
✅ PASS: Skill file has correct category (web_scanner)
✅ PASS: Skill file has wordpress tag
✅ PASS: OODA phases happened in correct order: discover → check → install → learn → execute
RESULTS: 14 passed, 0 failed, 14 total
Select option: 4
Describe the pentest task: scan ports 80, 443, 8080 on 127.0.0.1 using socket
[1/3] Generating script with Claude...
[2/3] Writing script to Kali container...
File written: /app/scripts/poc_1738000000000.py (512 bytes)
[3/3] Executing script...
--- Execution Result ---
Exit code: 0
--- STDOUT ---
Port 80: OPEN
Port 443: CLOSED
Port 8080: CLOSED