Snapshot Notebook Abide Persistence
Code analysis pipeline exposing snapshot types via Model Context Protocol (MCP). Parses code, documents, data, and config files into structured DB snapshots for targeted AI retrieval.
- Quick Start
- Agent & Client Setup
- LLM Permission Model
- Agent Workflow
- Binary File Headers
- Nim Parser
- Admin CLI
- Available MCP Tools
- Snapshot Types
- Storage Architecture
- Configuration
- Logging
- Architecture
- Security
- Requirements
- Troubleshooting
SNAP is distributed as a self-contained binary — no Python installation required.
Download the binary for your platform from the latest release:
| Platform | Binary |
|---|---|
| Windows (x86-64) | snap-mcp.exe |
| Linux (x86-64) | snap-mcp |
| macOS (arm64) | snap-mcp |
Place the binary in a directory of your choice, e.g. C:\Users\<username>\snap\.
On Linux/macOS, make it executable:
chmod +x snap-mcpAll parsers (Nim, tree-sitter, semgrep) are bundled — no additional installs required.
mkdir -p data/logs data/staging data/repos data/projectsCreate a .env file in the same directory as the binary:
# SQLite is the default — no additional setup required
SNAP_DB_MODE=sqlite
SNAP_SQLITE_PATH=data/snap.db
# Optional: PostgreSQL
# SNAP_DB_MODE=postgres
# SNAP_POSTGRES_DSN=postgresql://user:pass@localhost:5432/snap
# Logging
SNAP_LOG_LEVEL=INFO
SNAP_LOG_JSON=true# Windows
snap-mcp.exe --help
# Linux / macOS
./snap-mcp --helpEnterprise licensing and source access inquiries: cll.automata@outlook.com
See agent_setups.md for setup guides covering Claude Code, Claude Desktop, GitHub Copilot Chat, HTTP+SSE, and Azure cloud deployment.
The LLM has strictly limited rights enforced at runtime in app/mcp/tools.py — not by convention or docstrings.
DB snapshot reads only. The LLM never reads raw files.
| Tool | Notes |
|---|---|
get_project_manifest |
Read processing stats from DB |
query_snapshots |
Query DB by type or file path |
get_system_metrics |
Read system-wide aggregated metrics |
list_projects |
List all projects in DB |
list_runs |
List processing runs for a project |
| Tool | Notes |
|---|---|
get_project_notebook |
Read assembled project snapshots from DB |
clone_to_repos |
Clones GitHub repo into repos/ — auto-ingests in background, LLM does not read files |
copy_to_staging |
Copies local dir to staging/ — auto-ingests in background, LLM does not read files |
upload_to_staging |
Upload file content to staging |
get_staging_info |
File names, sizes, timestamps only — no file content |
clear_staging |
Delete staging files for a project |
kill_task |
Cancel a stuck async tool call |
| Tool | Reason |
|---|---|
delete_project |
No delete rights |
promote_run |
No write rights |
process_local_project |
No ingest rights |
The LLM never: reads raw files, reads GitHub raw content, ingests files, sorts/filters files, or processes files. SNAP does all of this.
SNAP is the ingest engine. The LLM stages content — SNAP ingests it.
LLM: clone_to_repos(repo_url, vendor_id)
└─► Clones into repos/{project_id}/. project_id = repo name, derived by SNAP.
SNAP: auto-ingests in background thread → stores in DB → clears repos/
LLM (on request): query_snapshots / get_project_notebook
LLM: copy_to_staging(project_id, source_path)
└─► Copies files to staging/{project_id}/. Returns immediately.
SNAP: auto-ingests in background thread → stores in DB → clears staging/
LLM (on request): query_snapshots / get_project_notebook
Rules:
- LLM stages ONE operation: clone trigger (GitHub) or staging copy (local)
- LLM does NOT ingest, filter, read, or process files — ever
- All filtering and ingest happens inside SNAP
- LLM reads only structured snapshot data from DB
SNAP uses binary file headers to associate files with projects without requiring directory structure.
FileHeader (variable size):
magic: "SNAPFILE" (8 bytes)
version: uint16 (2 bytes)
project_id_len: uint16 (2 bytes)
project_id: utf-8 string (variable)
snapshot_count: uint32 (4 bytes)
[file content follows]
from app.extraction.binary_packer import write_file_header, read_project_id_from_file
# Write file with project association
content = b"# Project Notes\n\nImplementation details..."
write_file_header("notes.md", "SNAP", content)
# Read project_id from file
project_id = read_project_id_from_file("notes.md") # Returns "SNAP"1. File with binary header uploaded via upload_to_staging
2. SNAP reads header → extracts project_id
3. File placed in staging/{project_id}/
4. Auto-ingested into {project_id} project in background
5. Staging cleared
Use Cases:
- Chat conversation logs (project_id = working project name)
- Project notes and documentation
- Context files for RAG queries
- Cross-project file sharing with explicit ownership
High-performance native parser bundled inside the snap-mcp binary. Handles all document, data, and config formats.
| Snap Type | Formats | Output Fields |
|---|---|---|
text (DocGraph) |
.md, .html, .htm, .docx, .pdf, .txt, .rtf |
doc.* |
csv |
.csv, .tsv, .xml (data) |
csv.* |
config |
.json, .jsonl, .xml (config), .yaml, .yml, .toml |
config.* |
XML is auto-classified at parse time: doc-like tags → text, repeated record rows → csv, everything else → config.
| Operation | Before | Nim | Speedup |
|---|---|---|---|
| Parse 1MB markdown | ~450ms | ~8ms | 56x |
| Extract CSV schema | ~180ms | ~3ms | 60x |
| Parse config JSON | ~120ms | ~4ms | 30x |
Note: Tree-sitter and semgrep remain in Python (external tools, already optimized).
Human-only operations that bypass MCP entirely. Install with pip install -e . then use snap-admin.
# List all ingested projects with snapshot and run counts
snap-admin list-projects
# Show all runs for a project (active / superseded / failed)
snap-admin runs <project_id>
# Health check and active-run summary for a project
snap-admin manifest <project_id>
# Browse snapshots — summary by type, or drill in by type or file
snap-admin snapshots <project_id>
snap-admin snapshots <project_id> --type <snapshot_type>
snap-admin snapshots <project_id> --file <source_file_path>
# Delete a project and all its data (DB, repos, staging)
snap-admin delete-project <project_id>
# Copy a local directory into staging for a project
snap-admin upload-to-staging <project_id> <source_path>
# Clone a GitHub repo directly (no LLM involved) — repos_watcher ingests
snap-admin clone-github <repo_url>Also callable as python -m app.admin <command>.
| Tool | Permission | Description |
|---|---|---|
get_project_notebook |
Allowed | Read complete project snapshots from DB |
get_project_manifest |
Allowed | Read processing stats from DB |
query_snapshots |
Allowed | Query by snapshot type or file path |
get_system_metrics |
Allowed | System-wide aggregated metrics |
list_projects |
Allowed | List all projects with snapshot counts |
list_runs |
Allowed | List processing runs for a project |
clone_to_repos |
Approval required | Clone GitHub repo → auto-ingests in background |
copy_to_staging |
Approval required | Copy local directory into staging |
upload_to_staging |
Approval required | Upload file content to staging |
get_staging_info |
Approval required | List staging file names, sizes, timestamps |
clear_staging |
Approval required | Clear all staging files for a project |
kill_task |
Approval required | Cancel a stuck async tool call |
delete_project |
Blocked | LLM has no delete rights — use snap-admin |
promote_run |
Blocked | LLM has no write rights |
process_local_project |
Blocked | LLM has no ingest rights |
| Type | Parser | Description |
|---|---|---|
file_metadata |
tree_sitter | Path, language, LOC, package info |
imports |
tree_sitter | External and internal module dependencies |
exports |
tree_sitter | Functions, classes, constants, types |
functions |
tree_sitter | Names, signatures, async status, decorators |
functions_core |
tree_sitter | Full function bodies, docstrings, return types, parameters |
classes |
tree_sitter | Names, inheritance, methods, properties |
connections |
tree_sitter | Dependencies, function calls, instantiations |
| Type | Parser | Description |
|---|---|---|
security |
semgrep | Vulnerabilities, secrets, SQL injection, XSS |
quality |
semgrep | Antipatterns, code smells, TODOs, deprecated usage |
| Type | Parser | Description |
|---|---|---|
doc_metadata |
nim_parser | Title, author, date, version, language |
doc_content |
nim_parser | Sections, URLs, code snippets |
doc_analysis |
nim_parser | Requirements, entities, references, related files |
Supported: .md, .html, .docx, .pdf, .txt, .rtf — and .xml when classified as a document.
| Type | Parser | Description |
|---|---|---|
csv_schema |
nim_parser | Column names, inferred types, column count |
csv_data |
nim_parser | Row count, null counts, unique counts, first 5 rows |
Supported: .csv, .tsv, .xml (when classified as row data).
| Type | Parser | Description |
|---|---|---|
config_metadata |
nim_parser | Top-level keys, nested paths, env vars, DB strings, API endpoints/hosts |
Supported: .json, .jsonl, .yaml, .yml, .toml, .xml (when classified as config).
SNAP uses a hybrid storage model with binary snapshot format for efficient Nim integration.
| Mode | Storage | Use Case |
|---|---|---|
sqlite |
SQLite (default) | Single-user, embedded, zero-config |
postgres |
PostgreSQL | Multi-user, networked, production |
dual |
Both | Development, migration, redundancy |
Set via .env:
SNAP_DB_MODE=sqlite # Default
SNAP_DB_MODE=postgres # Requires SNAP_POSTGRES_DSN
SNAP_DB_MODE=dual # Both databasesSnapshots are stored as binary-packed data for performance and Nim compatibility.
Snapshot Structure:
SnapshotHeader (561 bytes):
magic: "SNAP" (4 bytes)
version: uint16 (2 bytes)
snapshot_type: uint8 (1 byte)
field_count: uint16 (2 bytes)
content_hash: SHA-256 (32 bytes)
simhash: uint64 (8 bytes)
minhash: 128 × uint32 (512 bytes)
FieldDescriptor (11 bytes each):
field_id: uint16 (2 bytes)
data_type: uint8 (1 byte) # 0=string, 1=int, 2=binary, 3=array
offset: uint32 (4 bytes)
length: uint32 (4 bytes)
Data Block (variable):
Packed field data referenced by descriptors
Storage:
CREATE TABLE snapshot_notebooks (
snapshot_id TEXT PRIMARY KEY,
run_id TEXT NOT NULL,
project_id TEXT NOT NULL,
snapshot_type TEXT NOT NULL,
source_file TEXT NOT NULL,
binary_data BYTEA NOT NULL, -- Binary-packed snapshot
source_hash TEXT,
content_hash TEXT, -- SHA-256 hex
simhash BIGINT, -- 64-bit similarity hash
minhash TEXT, -- 128 × 32-bit MinHash (CSV)
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);| Hash Type | Size | Purpose |
|---|---|---|
source_hash |
SHA-256 | File content hash (deduplication) |
content_hash |
SHA-256 | Extracted content hash (change detection) |
simhash |
64-bit | Similarity fingerprint (near-duplicate detection) |
minhash |
128 × 32-bit | Set similarity (document comparison) |
Versioning Logic:
New file ingested:
1. Calculate source_hash
2. Query DB for existing snapshot with same source_file + source_hash
3. If exists → skip (deduplication)
4. If not exists → create new snapshot (versioning)
5. Multiple versions coexist in DB (query by run_id or latest)
Environment variables use the SNAP_ prefix.
| Variable | Default | Description |
|---|---|---|
SNAP_DB_MODE |
sqlite |
Database mode: sqlite, postgres, or dual |
SNAP_POSTGRES_DSN |
(required for postgres/dual) | PostgreSQL connection string |
SNAP_SQLITE_PATH |
data/snap.db |
SQLite database path |
| Variable | Default | Description |
|---|---|---|
SNAP_DATA_DIR |
data/ |
Base data directory |
SNAP_STAGING_DIR |
data/staging/ |
File staging (auto-ingest) |
SNAP_REPOS_DIR |
data/repos/ |
GitHub clones (cleared after ingest) |
| Variable | Default | Description |
|---|---|---|
SNAP_LOG_LEVEL |
INFO |
Logging level |
SNAP_LOG_JSON |
true |
JSON-formatted logs |
SNAP_GIT_CLONE_DEPTH |
1 |
Shallow clone depth |
SNAP_GIT_CLONE_TIMEOUT_SECONDS |
600 |
Git clone timeout (seconds) |
| Variable | Default | Description |
|---|---|---|
SNAP_PARSER_LIMITS_SOFT_CAP_LOC |
1,500 | Code warning threshold (LOC) |
SNAP_PARSER_LIMITS_HARD_CAP_LOC |
5,000 | Code reject threshold (LOC) |
SNAP_PARSER_LIMITS_SOFT_CAP_BYTES |
500,000 | Text warning threshold (bytes) |
SNAP_PARSER_LIMITS_HARD_CAP_BYTES |
10,000,000 | Text reject threshold (bytes) |
| Variable | Default | Description |
|---|---|---|
SNAP_AUTH_ENABLED |
false |
Enable JWT/OAuth authentication |
SNAP_AUTH_JWT_SECRET |
(empty) | Secret for JWT signing |
SNAP_AUTH_GITHUB_CLIENT_ID |
(empty) | GitHub OAuth app client ID |
SNAP_AUTH_GITHUB_CLIENT_SECRET |
(empty) | GitHub OAuth app client secret |
Stdio mode (Claude Code) is never affected by auth settings.
SNAP writes structured JSON logs to three destinations simultaneously.
| File | Level | Rotation | Notes |
|---|---|---|---|
data/logs/app.log |
WARNING+ | None | Plain FileHandler — VSCode-safe, always readable |
data/logs/app_debug.log |
INFO+ | 5 MB × 3 | RotatingFileHandler — full debug trail |
| stderr | all levels | — | MCP-compatible; required for stdio transport |
Controlled by SNAP_LOG_JSON (default true). Each line is a JSON object:
{"ts": "2026-02-15 12:00:00,000", "level": "INFO", "name": "snap", "msg": "Snapshot created", "snapshot_id": "...", "project_id": "...", "snapshot_type": "functions", "parser": "tree_sitter", "fields_count": 12}Set SNAP_LOG_JSON=false for human-readable output:
2026-02-15 12:00:00,000 INFO snap Snapshot created
| Event | Level | Key Fields |
|---|---|---|
| File parsed | DEBUG | path, tag, size, language, parse_duration_ms, snapshots_created, parsers |
| Snapshot created | INFO | snapshot_id, snapshot_type, parser, fields_count |
| File categorized | INFO / WARNING / ERROR | path, size, tag, reason |
| Repo processing complete | INFO | files_processed, snapshots_created, snapshot_types_summary, parsers_summary, total_duration_ms |
| Tag | Level | Meaning |
|---|---|---|
normal |
INFO | Within soft cap — processed normally |
large |
WARNING | Exceeds SOFT_CAP_LOC / SOFT_CAP_BYTES — processed with warning |
potential_god |
WARNING | Suspected god file — processed with warning |
rejected |
ERROR | Exceeds HARD_CAP_LOC / HARD_CAP_BYTES — skipped |
GitHub:
clone_to_repos(repo_url)
↓ clone completes synchronously
repos/{project_id}/
↓ auto-ingest background thread
ingest_cloned_repo() ← security filtering, file enumeration
↓
file_router → parsers → field_mapper → snapshot_builder → DB
↓
repos/ cleared
Local:
copy_to_staging(source_path)
↓ stage_directory() filters and copies
staging/{project_id}/
↓ auto-ingest background thread
process_project() ← security filtering, file enumeration
↓
file_router → parsers → field_mapper → snapshot_builder → DB
↓
staging/ cleared
SNAP/
├── app/
│ ├── admin.py # Admin CLI (human-only: delete, upload, clone, list)
│ ├── main.py # Orchestration pipeline
│ ├── config/
│ │ └── settings.py
│ ├── extraction/
│ │ ├── binary_packer.py # Binary snapshot packer/unpacker (Nim-compatible)
│ │ ├── field_mapper.py # Maps parser output to snapshot types
│ │ └── snapshot_builder.py
│ ├── ingest/
│ │ ├── file_router.py # Routes files to parsers by extension
│ │ ├── github_cloner.py # Shallow clone → repos/
│ │ └── local_loader.py # stage_directory() + staging helpers
│ │
│ ├── logging/
│ │ └── logger.py
│ ├── mcp/
│ │ ├── auth.py # JWT + GitHub OAuth
│ │ ├── run.py # Entry point: stdio or HTTP+SSE
│ │ ├── security.py # Input validation, path traversal prevention
│ │ ├── server.py # MCP server, tool registry, Starlette app
│ │ └── tools.py # Tool handlers + permission enforcement
│ ├── parsers/
│ │ ├── nim_parser.nim # Native parser: doc, csv, config formats (compile to binary)
│ │ ├── nim_parser.py # Python wrapper for Nim parser
│ │ ├── pre_converter.nim # Pre-processing helper for Nim parser
│ │ ├── semgrep_parser.py
│ │ └── tree_sitter_parser.py
│ ├── schemas/
│ │ ├── master_notebook.yaml
│ │ └── snapshot_templates/ # JSON templates (defined and gated by master_notebook.yaml)
│ ├── security/
│ │ └── network_policy.py
│ └── storage/
│ ├── db.py
│ └── snapshot_repo.py # CRUD, upsert, run versioning
├── data/
│ ├── logs/
│ ├── projects/ # Project manifests
│ ├── repos/ # GitHub clones (cleared after ingest)
│ └── staging/ # Local file staging (cleared after ingest)
├── docker/
│ └── Dockerfile
├── docker-compose.yml
├── pyproject.toml
├── run_mcp.bat
└── run_mcp.sh
- No raw file reads — LLM reads only structured DB snapshots
- No ingest — SNAP ingests and parses; LLM never touches files
- No delete/write rights —
delete_project,promote_runraise immediately - project_id locked — derived from repo URL on clone; LLM cannot supply or rename
- vendor_id restricted — alphanumeric +
_@.-only, max 64 chars; blocks injection chars - Runtime enforcement —
ALLOWED_TOOLS/NOT_ALLOWED_TOOLSfrozensets checked at handler entry
- Project ID:
^[a-zA-Z0-9_-]{3,64}$ - Vendor ID:
^[a-zA-Z0-9_@.\-]{1,64}$ - Filenames: No path traversal (
..,\x00,~), no backslash, reserved names blocked - Repo URLs: HTTPS GitHub URLs only
- Symlinks: Rejected at staging time
All filtering enforced by SNAP at copy time — LLM has no role.
Pruned directories (never traversed):
node_modules · .git · .svn · .hg · __pycache__ · .venv · venv · .next · .nuxt · .expo · .gradle · build · dist · target · Pods · .terraform · vendor
Ignored file patterns:
| Category | Patterns |
|---|---|
| Secrets / credentials | *.pem, *.key, *.p12, .env, .env.*, *.token, serviceAccountKey.json |
| Cloud configs | .aws/, .azure/, .gcloud/ |
| Build artifacts | *.min.js, *.min.css, *.pyc, *.class, *.so, *.dll, *.exe |
| Coverage / logs | coverage/, *.log, *.lock |
app/schemas/master_notebook.yaml is the single source of truth for all snapshot types and field definitions.
- Template validation —
SnapshotBuildervalidates every template file against the master notebook at startup. Templates not registered insnapshot_templatesare rejected and never run. - Field validation — Any field in a template not registered in
field_id_registrycauses the entire template to be rejected. - MCP query validation —
validate_snapshot_typereads valid types directly from the master notebook at runtime. No hardcoded lists.
- Prompt injection — 30+ patterns blocked: instruction overrides, role hijacking, jailbreak triggers, exfiltration probes
- Secret redaction — AWS keys, GitHub tokens, JWTs, API keys auto-redacted in all field values
- AST-level filtering — tree-sitter nodes scanned for imperative patterns; flagged as
[FILTERED:IMPERATIVE] - Content safety — high-entropy detection, base64 blocks, hex-encoded data flagged before DB insertion
The binary release has no install-time dependencies. All parsers and libraries are bundled.
| Requirement | Notes |
|---|---|
| OS | Windows x86-64 · Linux x86-64 · macOS |
| SQLite3 | Bundled — zero config |
| PostgreSQL | 14+ — optional, only if SNAP_DB_MODE=postgres |
Bundled in the binary: tree-sitter (all languages) · semgrep · nim_parser · all Python dependencies.
-
Logs must go to stderr (not stdout):
handler = logging.StreamHandler(sys.stderr)
-
Use the wrapper script — Claude Code does not respect cwd:
@echo off cd /d C:\Users\<username>\snap snap-mcp.exe %*
-
Verify connection:
claude mcp list # snap: ... - ✓ Connected
SNAP_POSTGRES_DSN=postgresql://user:pass@localhost:5432/snapSNAP auto-installs and upgrades semgrep on startup. If auto-install fails:
.venv\Scripts\python.exe -m pip install --upgrade semgrep© CLL Automata