Skip to content

that-in-rust/parseltongue

Repository files navigation

Parseltongue

v0.9.6 - Test exclusion (90% token reduction) + single-binary architecture (80% disk reduction)

Parse your codebase into a queryable graph database. Get 2-5K token summaries instead of 500K+ token dumps. Query dependencies, find functions, analyze architectureβ€”all without reading every file.

12 languages: Rust Β· Python Β· JavaScript Β· TypeScript Β· Go Β· Java Β· C Β· C++ Β· Ruby Β· PHP Β· C# Β· Swift


⚑ Quick Start (60 seconds)

1. Install (macOS)

curl -fsSL https://raw.githubusercontent.com/that-in-rust/parseltongue/main/parseltongue-install-v096.sh | bash

2. Index Your Codebase

./parseltongue pt01-folder-to-cozodb-streamer . --db "rocksdb:mycode.db"

What happens: Parses all code files β†’ Extracts functions, structs, dependencies β†’ Stores in graph database

Example output:

Running Tool 1: folder-to-cozodb-streamer
Starting directory streaming...

Streaming Summary:
Total files found: 142
Files processed: 138
Entities created: 1,247 (CODE only)
  └─ CODE entities: 1,247
  └─ TEST entities: 3,821 (excluded for optimal LLM context)
Errors encountered: 0
Duration: 2.1s

βœ“ Tests intentionally excluded from ingestion for optimal LLM context
βœ“ Indexing completed

You now have: All functions, structs, traits, and their relationships in a queryable database.

3. Get Dependency Graph (2-5K tokens)

./parseltongue pt02-level00 --where-clause "ALL" --output deps.json --db "rocksdb:mycode.db"

What you get: Pure edge list showing who calls whatβ€”perfect for understanding architecture.

Example deps.json:

{
  "dependency_count": 487,
  "dependencies": [
    {
      "caller_id": "parseltongue_core::parse_file",
      "callee_id": "tree_sitter::Parser::parse",
      "relationship_type": "calls"
    },
    {
      "caller_id": "streamer::FileStreamer::stream_directory",
      "callee_id": "parseltongue_core::parse_file",
      "relationship_type": "calls"
    }
  ]
}

Token count: ~2-5K tokens (vs 500K+ for full codebase dump)


🎯 What You Just Got

Before Parseltongue:

  • Dump 50,000 lines of code as text β†’ 500K+ tokens
  • LLM sees everything, understands nothing
  • Context window explodes

After Parseltongue:

  • Query graph database β†’ Get exactly what you need
  • 2-5K tokens for dependency overview
  • 30K tokens for full entity signatures
  • LLMs get architectural context, not noise

Real metrics (parseltongue-core codebase):

  • Files: 29 Rust files
  • Entities found: 1,668 total (1,247 code + 421 tests)
  • Entities ingested: 1,247 (tests excluded - 75% token reduction)
  • Ingestion time: 2.1 seconds
  • Query time: < 50ΞΌs (microseconds!)

Visual: How Parseltongue Works

graph LR
    subgraph INPUT["Your Codebase"]
        C1["50,000 LOC<br/>150 files<br/>12 languages"]
    end

    subgraph PARSE["PT01: Index"]
        P1["tree-sitter<br/>parse"]
        P2["Extract<br/>entities"]
        P3["Build<br/>graph"]
        P1 --> P2 --> P3
    end

    subgraph DATABASE["CozoDB Graph"]
        D1["Functions<br/>Structs<br/>Traits"]
        D2["Caller→Callee<br/>relationships"]
        D3["Type info<br/>Signatures"]
        D1 -.-> D2 -.-> D3
    end

    subgraph QUERY["PT02: Export"]
        Q1["Level 0<br/>2-5K tokens"]
        Q2["Level 1<br/>30K tokens"]
        Q3["Level 2<br/>60K tokens"]
    end

    subgraph OUTPUT["LLM Context"]
        O1["Architecture<br/>understanding"]
        O2["Precise<br/>reasoning"]
    end

    INPUT ==> PARSE
    PARSE ==> DATABASE
    DATABASE ==> QUERY
    QUERY ==> OUTPUT

    style Q1 fill:#90EE90
    style DATABASE fill:#87CEEB
Loading

Visual flow: Parse codebase once β†’ Query graph many times β†’ Get exactly what you need


πŸ“Š Progressive Disclosure: Choose Your Detail Level

Get exactly the context you needβ€”nothing more, nothing less.

Level 0: Pure Edges (2-5K tokens) β€” RECOMMENDED

./parseltongue pt02-level00 --where-clause "ALL" --output edges.json --db "rocksdb:mycode.db"

Best for: Architecture overview, dependency mapping, "what calls what?"

Output: Edge list (caller β†’ callee relationships)


Level 1: Entity Signatures (30K tokens)

./parseltongue pt02-level01 --where-clause "ALL" --output entities.json --db "rocksdb:mycode.db"

Best for: Understanding interfaces, finding functions by name, analyzing API surface

Output: Function signatures, struct definitions, trait declarations

Example:

{
  "entity_id": "streamer::FileStreamer::stream_directory",
  "entity_name": "stream_directory",
  "entity_type": "function",
  "file_path": "./src/streamer.rs",
  "interface_signature": "pub async fn stream_directory(&self) -> Result<StreamingStats>"
}

Level 2: Full Type System (60K tokens)

./parseltongue pt02-level02 --where-clause "ALL" --output typed.json --db "rocksdb:mycode.db"

Best for: Deep type analysis, generic bounds, trait implementations

Output: Everything from Level 1 + type parameters, where clauses, trait bounds

Visual: Progressive Disclosure Levels

graph TB
    subgraph L0["🎯 Level 0: Pure Edges (2-5K tokens)"]
        L0A["caller_id β†’ callee_id<br/>relationship_type"]
        L0B["Example: 487 dependencies<br/>Use: Architecture overview"]
    end

    subgraph L1["πŸ“‹ Level 1: Entity Signatures (30K tokens)"]
        L1A["+ Function signatures<br/>+ Struct definitions<br/>+ File paths"]
        L1B["Example: 1,247 entities<br/>Use: Interface understanding"]
    end

    subgraph L2["πŸ”¬ Level 2: Full Type System (60K tokens)"]
        L2A["+ Type parameters<br/>+ Generic bounds<br/>+ Where clauses"]
        L2B["Example: Complete type info<br/>Use: Deep analysis"]
    end

    L0 -.->|Add signatures| L1
    L1 -.->|Add type system| L2

    style L0 fill:#90EE90
    style L1 fill:#FFE4B5
    style L2 fill:#FFB6C1
Loading

Choose based on need: Start with Level 0 (fastest, lowest token count), move up only if you need more detail.


πŸ” Common Queries

Find All Functions

./parseltongue pt02-level01 \
  --where-clause "entity_type = 'function'" \
  --output functions.json \
  --db "rocksdb:mycode.db"

Find Dependencies of Specific Module

./parseltongue pt02-level00 \
  --where-clause "caller_id CONTAINS 'streamer'" \
  --output streamer_deps.json \
  --db "rocksdb:mycode.db"

Get Public API Surface

./parseltongue pt02-level01 \
  --where-clause "interface_signature CONTAINS 'pub'" \
  --output public_api.json \
  --db "rocksdb:mycode.db"

πŸš€ Real-World Example: Onboarding to New Codebase

Scenario: You just joined a Rust project. 150,000 lines of code. Where do you start?

Step 1: Index (30 seconds)

./parseltongue pt01-folder-to-cozodb-streamer . --db "rocksdb:onboard.db"

Result:

Entities created: 8,423 (CODE only)
Duration: 12.3s

Step 2: Get Architecture Overview (5K tokens)

./parseltongue pt02-level00 --where-clause "ALL" --output arch.json --db "rocksdb:onboard.db"

You now know:

  • All module dependencies
  • Critical paths (most-called functions)
  • Architectural layers

Time to value: 42 seconds. Token cost: 5K (not 500K).

Step 3: Understand Entry Points

./parseltongue pt02-level01 \
  --where-clause "entity_name = 'main'" \
  --output entry.json \
  --db "rocksdb:onboard.db"

Result: Every main() function signature and location.

Step 4: Ask LLM

Feed arch.json + entry.json to Claude/GPT:

  • "What are the main architectural layers?"
  • "Which modules handle database access?"
  • "What's the critical path for user requests?"

Context used: 8K tokens (not 500K). LLM gives accurate architectural answers.

Visual: Complete 5-Phase Workflow

graph LR
    subgraph PHASE1["Phase 1: Index"]
        P1["PT01<br/>folder-to-cozodb-streamer"]
        P1A["Parse all code<br/>Build graph database"]
    end

    subgraph PHASE2["Phase 2: Export"]
        P2["PT02<br/>level00/01/02"]
        P2A["Query graph<br/>Export JSON"]
    end

    subgraph PHASE3["Phase 3: Edit"]
        P3["PT03<br/>llm-to-cozodb-writer"]
        P3A["LLM proposes changes<br/>Mark as 'future' state"]
    end

    subgraph PHASE4["Phase 4: Validate"]
        P4A["PT04: Syntax check"]
        P4B["PT05: Generate diff"]
        P4A --> P4B
    end

    subgraph PHASE5["Phase 5: Apply"]
        P5["PT06<br/>make-future-current"]
        P5A["Update ISG state<br/>Reset temporal markers"]
    end

    PHASE1 --> PHASE2
    PHASE2 --> PHASE3
    PHASE3 --> PHASE4
    PHASE4 -->|Valid| PHASE5
    PHASE4 -.->|Invalid| PHASE3

    style PHASE1 fill:#90EE90
    style PHASE2 fill:#87CEEB
    style PHASE4 fill:#FFB6C1
Loading

Typical usage: Phase 1 once β†’ Phase 2 many times (as needed for different queries)

Advanced usage: Phase 1-5 for LLM-driven code modifications with validation


πŸ€– Claude Code Agent Integration

After installation, restart Claude Code and invoke:

@parseltongue-ultrathink-isg-explorer

The agent uses Parseltongue to:

  • Answer "how does X work?" queries via database, not filesystem
  • Find functions/structs by name instantly
  • Map dependencies without reading every file
  • Generate architecture diagrams from graph data

Token efficiency: 99% reduction vs grep/glob approaches (Stanford TACL 2023 research)


πŸ’‘ What Makes This Different?

Traditional Approach (Broken)

# Dump everything as text
cat src/**/*.rs > context.txt

# Result:
# - 500,000+ tokens
# - No structure
# - LLM gets lost in noise
# - Context window explosion

Parseltongue Approach (Works)

# Parse into graph database
./parseltongue pt01-folder-to-cozodb-streamer . --db "rocksdb:code.db"

# Query exactly what you need
./parseltongue pt02-level00 --where-clause "entity_type = 'function'" --output funcs.json

# Result:
# - 2-60K tokens (your choice)
# - Structured data
# - LLM understands architecture
# - Queries in microseconds

The difference: Semantic graph vs dumb text dump.


πŸ“¦ Installation Options

Option 1: One-Line Install (Recommended)

curl -fsSL https://raw.githubusercontent.com/that-in-rust/parseltongue/main/parseltongue-install-v096.sh | bash

Includes:

  • parseltongue binary (49MB, v0.9.6)
  • Claude Code agent (.claude/agents/)
  • Documentation (.claude/.parseltongue/README.md)

Option 2: Manual Download

cd /path/to/your/project

# Download v0.9.6 binary
curl -L https://github.com/that-in-rust/parseltongue/releases/download/v0.9.6/parseltongue -o parseltongue
chmod +x parseltongue

# Verify
./parseltongue --version
# Output: parseltongue 0.9.6

Option 3: Build from Source

git clone https://github.com/that-in-rust/parseltongue.git
cd parseltongue
cargo build --release

# Binary at: target/release/parseltongue

πŸ”§ All Commands Reference

PT01: Index Codebase

./parseltongue pt01-folder-to-cozodb-streamer <dir> --db "rocksdb:<name>.db"

Purpose: Parse codebase into graph database Time: ~1-15 seconds depending on size Output: CozoDB database with all entities and relationships


PT02-Level00: Get Dependency Edges (2-5K tokens)

./parseltongue pt02-level00 --where-clause "<query>" --output <file> --db "rocksdb:<name>.db"

Purpose: Export dependency graph (who calls what) Best for: Architecture overview, dependency analysis Token count: 2-5K


PT02-Level01: Get Entity Signatures (30K tokens)

./parseltongue pt02-level01 --where-clause "<query>" --output <file> --db "rocksdb:<name>.db"

Purpose: Export function signatures, struct definitions Best for: Understanding interfaces, finding specific entities Token count: ~30K


PT02-Level02: Get Full Type System (60K tokens)

./parseltongue pt02-level02 --where-clause "<query>" --output <file> --db "rocksdb:<name>.db"

Purpose: Export complete type information Best for: Deep type analysis, generic bounds Token count: ~60K


PT07: Visual Analytics

./parseltongue pt07 --db "rocksdb:<name>.db"

Purpose: Terminal visualizations (entity counts, dependency cycles) Output: Unicode bar charts, cycle warnings


πŸŽ“ Understanding Where Clauses

Where clauses filter what you export from the database.

Common Patterns

Get everything:

--where-clause "ALL"

Filter by entity type:

--where-clause "entity_type = 'function'"
--where-clause "entity_type = 'struct'"
--where-clause "entity_type = 'trait'"

Filter by name:

--where-clause "entity_name = 'parse_file'"
--where-clause "entity_name CONTAINS 'stream'"

Filter by file path:

--where-clause "file_path CONTAINS 'src/streamer'"

Combine filters (CozoScript syntax):

--where-clause "entity_type = 'function' && interface_signature CONTAINS 'pub'"

πŸ“ˆ Performance Metrics

Tokio async runtime (real-world test):

  • Files: 150+
  • Entities: 2,576
  • Ingestion time: 0.24s
  • Query time: < 50ΞΌs

Query speed: Graph database queries are 10-100Γ— faster than grep/find.

Token efficiency:

  • Full text dump: ~500K tokens
  • Level 0 (edges): ~2-5K tokens (99% reduction)
  • Level 1 (signatures): ~30K tokens (94% reduction)
  • Level 2 (types): ~60K tokens (88% reduction)

🧠 How It Works: Interface Signature Graphs (ISG)

What is an ISG?

A semantic representation of your codebase as a graph:

graph LR
    subgraph PROBLEM["Context Window Problem"]
        P1["50,000 LOC<br/>codebase"]
        P2["Dump as<br/>text"]
        P3["500K+<br/>tokens"]
        P1 --> P2 --> P3
    end

    subgraph ISG["Interface Signature Graph"]
        I1["Parse codebase<br/>into graph"]
        I2["Unique IDs<br/>Dependencies<br/>Metadata<br/>Relationships"]
        I1 --> I2
    end

    subgraph LEVELS["Progressive Disclosure"]
        L0["Level 0<br/>Pure Edges<br/>2-5K tokens<br/>RECOMMENDED"]
        L1["Level 1<br/>Signatures<br/>~30K tokens"]
        L2["Level 2<br/>+ Type System<br/>~60K tokens"]
        L0 -.-> L1 -.-> L2
    end

    subgraph OUTCOME["LLM-Friendly Context"]
        O1["2-60K tokens<br/>(vs 500K+)"]
        O2["Architectural<br/>reasoning"]
        O3["Precise code<br/>changes"]
        O1 --> O2 --> O3
    end

    PROBLEM ==> ISG
    ISG ==> LEVELS
    LEVELS ==> OUTCOME
Loading

ISG Components

  1. Unique Interface Identifiers: Every function, struct, trait gets a stable ID
  2. Dependency Relationships: Explicit caller β†’ callee mappings
  3. Rich Metadata: Type signatures, file paths, line numbers
  4. Queryable Graph: CozoDB (RocksDB-backed) for microsecond queries

Why It Works

Traditional approach (broken):

  • Dump all code as text β†’ 500K tokens
  • No structure, no queryability
  • LLM drowns in noise

ISG approach (works):

  • Parse code semantically β†’ Graph database
  • Query exactly what you need β†’ 2-60K tokens
  • LLM sees architecture, not implementation details

🎯 Use Cases

Visual: Pick Your Use Case

graph TB
    START{What do you<br/>need to do?}

    START -->|Understand new codebase| UC1[Level 0 Export<br/>2-5K tokens<br/>Architecture overview]
    START -->|Find specific function| UC2[Level 1 Export<br/>Filter by name<br/>Get signature]
    START -->|What calls this?| UC3[Level 0 Export<br/>Filter caller_id<br/>Reverse dependencies]
    START -->|What's the public API?| UC4[Level 1 Export<br/>Filter 'pub'<br/>API surface]
    START -->|Deep type analysis| UC5[Level 2 Export<br/>Full type system<br/>Generic bounds]
    START -->|Visualize dependencies| UC6[PT07<br/>Terminal analytics<br/>Bar charts]

    UC1 -->|Time| T1[42 seconds]
    UC2 -->|Time| T2[< 1 second]
    UC3 -->|Time| T3[< 1 second]
    UC4 -->|Time| T4[< 1 second]
    UC5 -->|Time| T5[< 1 second]
    UC6 -->|Time| T6[< 1 second]

    style UC1 fill:#90EE90
    style T1 fill:#FFE4B5
Loading

Quick decision: If unsure, start with Level 0 (2-5K tokens, architecture overview).

1. New Codebase Onboarding

Problem: 150K lines of unfamiliar code Solution: Index β†’ Export Level 0 β†’ Feed to LLM β†’ Get architecture summary Time: 60 seconds, 5K tokens

2. Dependency Analysis

Problem: "What depends on this function?" Solution: Query graph for reverse dependencies Time: < 1 second

3. API Surface Analysis

Problem: "What's our public API?" Solution: --where-clause "interface_signature CONTAINS 'pub'" Time: < 1 second

4. Refactoring Planning

Problem: "What breaks if I change this?" Solution: Query dependency graph, find blast radius Time: < 1 second

5. LLM-Assisted Development

Problem: Need architectural context for code changes Solution: Export Level 1, feed to LLM with prompt Result: Accurate answers with 94% fewer tokens


πŸ”¬ Technical Details

Multi-Language Support

Query-based entity extraction using tree-sitter .scm query files:

Supported:

  • Rust, Python, JavaScript, TypeScript
  • Go, Java, C, C++
  • Ruby, PHP, C#, Swift

How it works:

  1. Parse file with tree-sitter (gets AST)
  2. Run .scm queries to find entities (functions, structs, etc.)
  3. Extract metadata (name, signature, location)
  4. Store in graph database with relationships

Dual-Format Export

Every export generates two files automatically:

JSON (tool compatibility):

{
  "entity_id": "module::function",
  "entity_type": "function"
}

TOON (30-40% token reduction):

entity_id	entity_type
module::function	function

TOON (Tab-Oriented Object Notation) uses tabs instead of JSON syntaxβ€”same data, fewer tokens.

Test Exclusion (v0.9.6)

Tests are detected but excluded from database:

Detection heuristics:

  • File paths: tests/, *_test.rs, test_*.py
  • Attributes: #[test], #[tokio::test], @pytest.mark
  • Content: unittest, describe(, it(

Why exclude:

  • Tests pollute architectural context
  • 75-90% token reduction
  • CODE entities only = cleaner LLM context

Example output:

Entities created: 1,247 (CODE only)
  └─ CODE entities: 1,247
  └─ TEST entities: 3,821 (excluded for optimal LLM context)

πŸ› Troubleshooting

Binary Won't Download

# Check you're in a git repository
ls .git
# If not: git init

# Try manual download
curl -L https://github.com/that-in-rust/parseltongue/releases/download/v0.9.6/parseltongue -o parseltongue
chmod +x parseltongue

Database Access Errors

# Delete and recreate database
rm -rf mycode.db/
./parseltongue pt01-folder-to-cozodb-streamer . --db "rocksdb:mycode.db"

No Entities Found

# Check file extensions are supported
./parseltongue pt01-folder-to-cozodb-streamer . --db "rocksdb:test.db" --verbose

# Verify .rs, .py, .js, .ts, etc. files exist in directory
ls **/*.rs

πŸ“š Additional Resources

Release notes: https://github.com/that-in-rust/parseltongue/releases/tag/v0.9.6 Issues: https://github.com/that-in-rust/parseltongue/issues Source code: https://github.com/that-in-rust/parseltongue


πŸŽ‰ What's New in v0.9.6

Test Exclusion (90% token reduction)

Tests are detected but excluded from ingestion:

  • Before: 483 total entities (code + tests mixed)
  • After: 57 CODE entities (tests excluded)
  • Result: 88% entity reduction, 33.6K token savings

Single Binary Architecture (80% disk reduction)

  • Before: 10 binaries (240MB)
  • After: 1 binary (49MB)
  • All pt01-pt07 commands: Accessible via single parseltongue binary

Zero Breaking Changes

All existing commands work identically. Update with confidence.


πŸ“ License

MIT OR Apache-2.0

About

Use Parseltongue to speak to your code at an aggregated level - using Interface Signature Graphs

Resources

Stars

Watchers

Forks

Packages

No packages published