Skip to content

cyberdiagram/pentest-executor

Repository files navigation

Pentest Executor

AI-driven penetration testing execution environment. A Kali Linux Docker container runs an MCP server exposing pentest tools. A host-side TypeScript agent uses Claude to generate PoC scripts or run fully autonomous multi-turn attack loops — planning, installing tools, learning usage, executing scans, and recovering from errors automatically.

Architecture

┌─────────────────────────┐        MCP (Streamable HTTP)        ┌─────────────────────────┐
│       Host Machine      │ ──────────── :3001 ────────────────> │   Kali Docker Container │
│                         │                                      │                         │
│  src/index.ts (CLI)     │                                      │  server.py (MCP Server) │
│  src/executor.ts        │  execute_shell_cmd / write_file      │  - execute_shell_cmd    │
│  src/mcp-client.ts      │  execute_script / manage_packages    │  - write_file           │
│  src/skills.ts          │ <────────── results ────────────────>│  - execute_script       │
│                         │                                      │  - manage_packages      │
│  Claude API (Anthropic) │                                      │                         │
│                         │                                      │  nmap, nikto, hydra,    │
│  skills/*.md (library)  │                                      │  sqlmap, gobuster,      │
│                         │                                      │  whatweb, netcat, python │
└─────────────────────────┘                                      └─────────────────────────┘

Project Structure

pentest-executor/
├── docker-compose.yml          # Kali service with volume mounts + apt cache volumes
├── .env.example                # Environment variable template
├── package.json                # Node.js project (ES modules)
├── tsconfig.json               # TypeScript configuration
├── kali/
│   ├── Dockerfile              # Kali Linux + pentest tools + Python MCP server
│   ├── server.py               # MCP server (4 tools, Streamable HTTP, port 3001)
│   └── requirements.txt        # mcp[http], uvicorn, requests, pwntools
├── src/
│   ├── index.ts                # CLI entry point (interactive menu, 7 options)
│   ├── executor.ts             # Core engine: single-shot generation + strategy-selectable agentic loop
│   ├── mcp-client.ts           # MCP client wrapper (Streamable HTTP transport)
│   ├── skills.ts               # Skill library: list, read, save, index skills
│   ├── types.ts                # TypeScript interfaces (TacticalPlan, AgentResult, etc.)
│   └── utils/
│       └── instrumentation.ts  # OpenTelemetry SDK + Langfuse span processor init
├── skills/                     # Persistent skill library (.md files with YAML frontmatter)
│   ├── wpscan.md               # WordPress scanner usage guide
│   └── github-search.md        # GitHub API PoC search patterns and security audit checklist
├── test/
│   └── test-wpscan-ooda.ts     # OODA loop functional test (mock agent + assertions)
├── logs/                       # Volume-mounted — persisted reports
└── scripts/                    # Volume-mounted — persisted PoC scripts

Quick Start

Prerequisites

  • Docker & Docker Compose
  • Node.js >= 18
  • An Anthropic API key

1. Install Node.js Dependencies

npm install

2. Configure Environment

cp .env.example .env

Edit .env and set your ANTHROPIC_API_KEY.

3. Build & Start the Kali Container

# Build the image (first time or after Dockerfile changes)
docker compose build

# Start the container in the background
docker compose up -d

This will:

  • Pull the kalilinux/kali-rolling base image
  • Install pentest tools: nmap, nikto, hydra, sqlmap, gobuster, whatweb, netcat
  • Install Python dependencies: mcp[http], uvicorn, requests, pwntools
  • Start the MCP server listening on port 3001

4. Verify the MCP Server is Running

# Check container status
docker compose ps

# Check server logs
docker compose logs -f kali

# Quick health check — should return 405 (Method Not Allowed = server is up)
curl -s -o /dev/null -w "%{http_code}" http://localhost:3001/mcp

5. Run the CLI

npm run start
# or directly:
npx tsx src/index.ts

6. Run the Test Suite

npx tsx test/test-wpscan-ooda.ts

Container Management

# Stop the container
docker compose down

# Rebuild after changing Dockerfile or server.py
docker compose up -d --build

# Shell into the running container
docker exec -it kali-pentest /bin/bash

# View real-time server logs
docker compose logs -f kali

Environment Variables

Variable Default Description
ANTHROPIC_API_KEY (required) Anthropic API key for Claude
MCP_SERVER_URL http://localhost:3001 MCP server endpoint
CLAUDE_MODEL claude-sonnet-4-5-20250929 Claude model to use
LANGFUSE_SECRET_KEY (optional) Langfuse secret key — enables tracing when set
LANGFUSE_PUBLIC_KEY (optional) Langfuse public key — enables tracing when set
LANGFUSE_BASE_URL https://cloud.langfuse.com Langfuse API endpoint

Langfuse Observability

All execution paths are instrumented with Langfuse via OpenTelemetry for full observability of every Claude API call, tool dispatch, and agentic turn.

Setup

  1. Create a Langfuse account at cloud.langfuse.com (or self-host)
  2. Add your keys to .env:
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_BASE_URL=https://cloud.langfuse.com

Tracing is optional — if the keys are not set, the app runs normally with tracing disabled.

What Gets Traced

Every executor method produces a hierarchical trace:

Method Trace Structure
generateScript() generate-script — single span with token usage
autoExecute() auto-executegenerate-scriptwrite-fileexecute-script
executeFinal() execute-finalexecute-script
generateScriptFromPlan() generate-script-from-plan — single span with plan metadata + token usage
autoExecuteFromPlan() auto-execute-from-plangenerate-script-from-planwrite-fileexecute-script
runAgentLoop() Full hierarchical trace (see below)

Agentic Loop Trace Hierarchy

The autonomous mode (runAgentLoop) produces the richest traces:

agent-loop (root)
├── turn-1
│   ├── claude-api-call    (input/output tokens, stop_reason)
│   ├── tool-execute_shell_cmd   (args, result preview)
│   └── tool-write_file          (args, result preview)
├── turn-2
│   ├── claude-api-call
│   └── tool-execute_script
├── ...
└── turn-N
    └── claude-api-call    (end_turn)

Each span records:

  • Input/output data for debugging
  • Token usage (input + output tokens per API call, cumulative on root)
  • Tool arguments and result previews for every MCP dispatch
  • Session ID and tags for filtering in the Langfuse dashboard

CLI Options

The interactive menu provides seven top-level options. Option 5 has four sub-strategies.

Options Overview

Option Name Mode Human Review Executor Method(s)
1 Generate Script Single-shot N/A (generate only) generateScript()
2 Execute Script Direct MCP N/A (execute only) (direct mcp.executeScript())
3 Interactive Run Single-shot Yes generateScript()executeFinal()
4 Auto Run Single-shot No autoExecute()
5.1 Tool-based Autonomy Agentic loop No runAgentWithTacticalPlan(plan, "tool")runAgentLoop()
5.2 GitHub PoC Search Agentic loop No runAgentWithTacticalPlan(plan, "github")runAgentLoop()
5.3 Manual Construction Single-shot No autoExecuteFromPlan()
5.4 Interactive Plan Single-shot Yes generateScriptFromPlan()executeFinal()
6 Autonomous Run Agentic loop No runAgentLoop()
7 Exit

Option 1: Generate Script

Describe a pentest task in natural language. Claude generates a Python exploit script and displays it. You can optionally save it to the Kali container with a custom or auto-generated filename.

Flow: User input → generateScript() → display → optional mcp.writeFile()

Option 2: Execute Script

Run an existing script that is already present in the Kali container's /app/scripts/ directory. Provide the filename and optional command-line arguments.

Flow: Filename + args → mcp.executeScript() → display result

Option 3: Interactive Run (Human-in-the-Loop)

Generate a script, save it locally to .tmp_payload.py for review/editing in your editor, then execute the finalized version after you press Enter.

Flow: User input → generateScript() → save to .tmp_payload.py → user reviews/edits → executeFinal() → display result

Option 4: Auto Run

Fully automated: generate a script with Claude, write it to Kali, and execute it immediately with no human review.

Flow: User input → autoExecute() (generate → write → execute) → display result

Option 5: Load Tactical Plan

Load a JSON Tactical Plan file. Displays a summary with CVE, MITRE ATT&CK ID, confidence score, rationale, and tool. Then select one of four execution strategies:

  • 5.1 Tool-based Autonomy — Agent uses searchsploit, Metasploit, and other Kali tools to exploit the target. Full plan context injected.
  • 5.2 GitHub PoC Search — Agent searches GitHub API for public PoC exploits, clones them, performs security audit, adapts parameters, and executes.
  • 5.3 Manual Construction — Claude generates a Python exploit from the plan context, writes to Kali, and executes immediately (no agent loop).
  • 5.4 Interactive — Claude generates a script from the plan, saves locally for review/edit, then executes after user confirmation.

Option 6: Autonomous Run (Agentic Loop)

Full multi-turn agentic loop from a free-form task description. Claude plans, executes tools, installs missing packages, handles errors, learns tools, saves skill files, and retries autonomously.

Flow: Task + max turns → runAgentLoop() → multi-turn OODA cycle → display summary + report

Option 7: Exit

Gracefully shuts down the MCP connection, flushes Langfuse traces, and exits.

Autonomous Mode (Agentic Loop)

Option 6 enables a multi-turn agentic execution mode where Claude operates in a closed OODA loop:

User Task ──► Claude plans ──► Tool calls ──► Observe results ──► Iterate
                  │                                                   │
                  └────────────── until task complete ────────────────┘

OODA Cycle

When the agent encounters a missing tool, it follows this protocol automatically:

  1. Observe — Run which <tool> to check availability
  2. Orientmanage_packages(check) to confirm the tool is missing
  3. Decidemanage_packages(install) to install via apt-get
  4. Act — Read --help, save a skill file, then execute the original task

Example: WordPress Scan

User: "Scan http://192.168.1.50 for WordPress vulnerabilities"

Turn 1: execute_shell_cmd("which wpscan")         → not found
Turn 2: manage_packages(check, "wpscan")           → MISSING
Turn 3: manage_packages(install, "wpscan")         → SUCCESS
Turn 4: execute_shell_cmd("wpscan --help | head -80") → learns flags
Turn 5: save_new_skill("wpscan", ...) + execute_shell_cmd("wpscan --url ...") → parallel
Turn 6: Final report with users, plugins, versions, severity ratings

Tactical Plan Strategies

Option 5 loads a JSON Tactical Plan file and presents a strategy selection menu. The plan summary displays CVE ID, MITRE ATT&CK technique, confidence score, rationale tags, and recommended tool before the user selects a strategy.

Strategy 1: Tool-based Autonomy

The agent uses existing Kali tools (Metasploit, searchsploit) to exploit the target. The full Tactical Plan context (all 6 sections: TARGET, VULNERABLE ENDPOINT, PAYLOAD, EXPLOITATION STRATEGY, VULNERABILITY CONTEXT, SUCCESS CRITERIA) is injected into the prompt. The agent follows a 4-phase protocol:

  1. Recon & Tool Selectionsearchsploit <CVE>, msfconsole -q -x 'search <CVE>; exit'
  2. Installation & Verificationmanage_packages to check/install tools
  3. Configuration & Execution — Non-interactive msfconsole -q -x '...; exit' or standalone exploits
  4. Verification & Reporting — Validate results against success criteria, save to /app/logs/

Strategy 2: GitHub PoC Search

The agent searches GitHub for public proof-of-concept exploits via the GitHub Search API. It follows a 4-phase protocol:

  1. GitHub API Searchcurl queries to find CVE-related repositories sorted by stars
  2. Clone & Inspect — Shallow clone, README review, mandatory security audit of exploit code
  3. Adapt for Target — Modify IP/port, install dependencies, adjust parameters
  4. Execute & Verify — Run the adapted PoC, validate against success criteria

The agent can load the github-search skill file for detailed API patterns, rate limiting guidance, and security audit checklists.

Strategy 3: Manual Construction

Single-shot AI code generation using autoExecuteFromPlan(). Claude generates a complete Python exploit script from the plan context, writes it to Kali, and executes immediately with no human review.

Strategy 4: Interactive

Human-in-the-loop workflow. Claude generates a script from the plan, saves it to .tmp_payload.py for the user to review and edit, then executes the finalized version after user confirmation.

MCP Server Tools

The Kali container exposes four MCP tools via Streamable HTTP on port 3001:

write_file(filename, content)

Writes a file to /app/scripts/ inside the container. Files are volume-mounted to ./scripts/ on the host.

execute_script(filename, args?)

Runs python3 /app/scripts/<filename> with optional arguments. Output is captured (stdout + stderr) and truncated to 4000 chars. Timeout: 120 seconds.

execute_shell_cmd(command)

Executes an arbitrary shell command inside the Kali container. Returns exit code, stdout, and stderr. Timeout: 120 seconds. This is the foundation for the OODA loop — the agent runs ad-hoc commands like which, --help, scanning tools, etc.

manage_packages(action, package_name)

Package management with two actions:

  • check — verifies if a package is installed (uses shutil.which)
  • install — runs apt-get update && apt-get install -y <package> with a 300s timeout

Package names are validated against [a-z0-9\-] to prevent injection. All install actions are logged to /app/logs/installs.log.

Skill Library

The agent maintains a persistent skill library in ./skills/*.md. Each skill file contains YAML frontmatter and markdown body:

---
tool_name: "wpscan"
category: "web_scanner"
tags: ["wordpress", "cms", "enumeration"]
description: "WordPress security scanner for users, plugins, and themes."
---
# WPScan
## Key Commands
- Full scan: `wpscan --url <target> --enumerate u,p,t --force`
## Anti-Patterns
- Do not use on non-WordPress sites

Skills are indexed at startup and injected into the agent's system prompt. The agent can:

  • list_skills() — view all available skills
  • read_skill_file(tool_name) — load full usage guide before using a tool
  • save_new_skill(tool_name, content) — persist new knowledge after learning a tool

Built-in Skills

Skill Category Description
wpscan web_scanner WordPress security scanner for users, plugins, and themes
github-search recon Search GitHub for CVE proof-of-concept exploits via API and clone/adapt them

Docker Persistence

Named volumes persist apt package cache across container restarts:

volumes:
  - kali_apt_cache:/var/cache/apt    # cached .deb files
  - kali_apt_lib:/var/lib/apt        # package index

This means apt-get install re-runs after container recreation but skips re-downloading packages. The agent handles re-installation automatically via the OODA loop.

Design Decisions

  • Streamable HTTP transport — chosen over stdio/SSE for future cloud deployment of the Kali container on a remote host.
  • Volume mounts./logs, ./scripts, and apt caches persist across container restarts.
  • network_mode: host — allows the Kali container to scan the host's local network (Linux only).
  • Output truncation — prevents large tool outputs (nmap, gobuster) from blowing up LLM context.
  • Skill library — only the index (name + description) goes into the system prompt; full content is loaded on-demand to avoid prompt bloat.
  • maxTurns cap — agentic loop hard-capped at configurable turn limit (default 15) to prevent token burn.

Testing

The test suite validates the full OODA lifecycle with mocked API and MCP responses:

npx tsx test/test-wpscan-ooda.ts
✅ PASS: Agent checked if wpscan exists (which wpscan)
✅ PASS: Agent verified wpscan is missing (manage_packages check)
✅ PASS: Agent installed wpscan (manage_packages install)
✅ PASS: Agent learned wpscan usage (--help)
✅ PASS: Agent saved wpscan skill file (save_new_skill)
✅ PASS: Agent executed wpscan against http://192.168.1.50
✅ PASS: Final report mentions discovered user 'admin'
✅ PASS: Final report mentions WordPress version 5.8.1
✅ PASS: Completed within turn budget (used 6 of 10)
✅ PASS: wpscan.md skill file exists on disk
✅ PASS: wpscan appears in the skill index
✅ PASS: Skill file has correct category (web_scanner)
✅ PASS: Skill file has wordpress tag
✅ PASS: OODA phases happened in correct order: discover → check → install → learn → execute

RESULTS: 14 passed, 0 failed, 14 total

Example

Select option: 4
Describe the pentest task: scan ports 80, 443, 8080 on 127.0.0.1 using socket

[1/3] Generating script with Claude...
[2/3] Writing script to Kali container...
  File written: /app/scripts/poc_1738000000000.py (512 bytes)
[3/3] Executing script...

--- Execution Result ---
Exit code: 0
--- STDOUT ---
Port 80: OPEN
Port 443: CLOSED
Port 8080: CLOSED

About

Cyberdiagram - AI-Powered Penetration Testing Agent - executor

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors