From 35b6a27d0ca8c580c0c92187452dbe0972a63521 Mon Sep 17 00:00:00 2001
From: Nous <nous@agent.local>
Date: Sat, 4 Apr 2026 18:18:43 +0000
Subject: [PATCH] =?UTF-8?q?F035.4:=20Context=20Visibility=20=E2=80=94=20wh?=
 =?UTF-8?q?at=20the=20model=20actually=20sees?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

New observability sub-spec for full LLM context window transparency:
- ContextLogEntry with token breakdown by system prompt section
- Full payload ring buffer (optional, bounded)
- REST endpoints: /context/log, /context/diff
- Telegram /context command
- Dashboard Context Inspector panel
- Hooks into _build_api_payload() single choke point

Updated F035 umbrella spec:
- Added Layer 4 description
- Added F035.4 to sub-specs table
- Updated sequencing rationale (F035.4 is independent, parallelizable)
- Added success criterion #6 for context visibility
---
 docs/features/F035-observability.md        |   8 +-
 docs/features/F035.4-context-visibility.md | 493 +++++++++++++++++++++
 2 files changed, 499 insertions(+), 2 deletions(-)
 create mode 100644 docs/features/F035.4-context-visibility.md

diff --git a/docs/features/F035-observability.md b/docs/features/F035-observability.md
index 3b1c2cd..db04c18 100644
--- a/docs/features/F035-observability.md
+++ b/docs/features/F035-observability.md
@@ -32,7 +32,7 @@ Minsky's Chapter 6 ("Self-Knowledge is Dangerous") warns that unrestricted self-
 
 ---
 
-## Architecture — Three Layers
+## Architecture — Four Layers
 
 ```
 ┌─────────────────────────────────────────────────────────┐
@@ -62,6 +62,8 @@ Minsky's Chapter 6 ("Self-Knowledge is Dangerous") warns that unrestricted self-
 
 **Layer 3 — Behavioral Drift Detection (F035.3):** Periodic snapshots of key metrics with trend analysis. Catches slow drift that individual events don't reveal. "Is the system changing in ways nobody intended?"
 
+**Layer 4 — Context Visibility (F035.4):** Full transparency into what the LLM actually sees on every API call. Token breakdown by section, memory loading inventory, turn-over-turn diffs. "What did the model see when it made that choice?"
+
 ---
 
 ## Sub-Specs
@@ -71,8 +73,9 @@ Minsky's Chapter 6 ("Self-Knowledge is Dangerous") warns that unrestricted self-
 | F035.1 | Event Bus Observability | P1 | F006 |
 | F035.2 | Autonomous Action Audit Trail | P1 | F035.1 |
 | F035.3 | Behavioral Drift Detection | P2 | F035.2 |
+| F035.4 | Context Visibility | P1 | F035 (umbrella) |
 
-**Sequencing rationale:** F035.1 gives us the infrastructure (stats collection, endpoints). F035.2 adds causal metadata to events. F035.3 builds on both to detect trends. Each is independently useful.
+**Sequencing rationale:** F035.1 gives us the infrastructure (stats collection, endpoints). F035.2 adds causal metadata to events. F035.3 builds on both to detect trends. F035.4 is independent — it hooks into the runner pipeline, not the event bus — and can be built in parallel with F035.1-3. Each sub-spec is independently useful.
 
 ---
 
@@ -89,3 +92,4 @@ Minsky's Chapter 6 ("Self-Knowledge is Dangerous") warns that unrestricted self-
 3. Behavioral trends (fact growth rate, censor changes, check frequency drift) are tracked and anomalies flagged
 4. The dashboard has an "Autonomous Activity" panel showing recent self-modifications
 5. None of this adds measurable latency to the hot path (event processing)
+6. For any API call, you can see the full token breakdown by context section and which memory items were loaded
diff --git a/docs/features/F035.4-context-visibility.md b/docs/features/F035.4-context-visibility.md
new file mode 100644
index 0000000..67cf625
--- /dev/null
+++ b/docs/features/F035.4-context-visibility.md
@@ -0,0 +1,493 @@
+# F035.4 — Context Visibility (What the Model Actually Sees)
+
+**Status:** PROPOSED  
+**Priority:** P1 (critical for debugging prompt assembly)  
+**Depends on:** F035 (Observability umbrella)  
+**Author:** Nous + Tim  
+**Date:** 2026-04-04  
+
+---
+
+## Problem
+
+Nous assembles a complex context window for every LLM API call. The system prompt alone is a composite of:
+
+- Static identity & character
+- User profile (facts about Tim)
+- Active censors
+- Cognitive frame + frame-specific instructions
+- Working memory (current task, loaded context items)
+- Related decisions
+- Relevant facts
+- Recent conversations
+- Execution ledger (prior tool calls this session)
+- Diagnostic nudges (from F024 Critic)
+- Pending corrections
+- Platform-specific formatting rules (Telegram)
+
+On top of that, the **messages array** contains the full conversation history for the session — potentially including tool calls, tool results, and multi-turn context.
+
+Today there is **zero visibility** into what gets assembled. When something goes wrong — Nous doesn't use a fact it should know, repeats a tool call, or behaves oddly — debugging requires reading code and mentally reconstructing what the context must have looked like. There's no way to answer:
+
+- "What was in the system prompt on turn 3?"
+- "How many tokens did the tools definition consume?"
+- "Did the relevant fact about X actually get loaded?"
+- "What changed between turn 5 and turn 6?"
+
+This is the difference between F035.1-3 (what the *system* did) and F035.4 (what the *model* was told). The other layers trace autonomous actions and system health. This layer traces the **information supply chain** to the LLM itself.
+
+---
+
+## Design Philosophy
+
+The model's behavior is fully determined by its context window. If you can see exactly what went in, you can explain exactly what came out. This is the **most direct debugging tool possible** — not inference about what might have happened, but the actual artifact.
+
+The key tension is **storage cost vs diagnostic value**. Full prompt payloads are large (50K-200K tokens per call). We solve this with tiered retention: structured metadata always, full payloads on-demand.
+
+---
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────┐
+│              Dashboard / API / Telegram                   │
+│    Browse by session → turn → API call                    │
+│    Token breakdown charts, context diff view              │
+└───────────────┬───────────────────────────────────────────┘
+                │
+    ┌───────────▼────────────┐
+    │   ContextLog Service   │
+    │                        │
+    │  • Intercepts payload  │
+    │    at _build_api_      │
+    │    payload() exit      │
+    │  • Extracts metadata   │
+    │  • Stores structured   │
+    │    summary + optional  │
+    │    full payload        │
+    └───────────┬────────────┘
+                │
+    ┌───────────▼────────────┐     ┌─────────────────────┐
+    │  context_log table     │     │  Full Payload Store  │
+    │  (always written)      │     │  (ring buffer, last  │
+    │                        │     │   N calls per        │
+    │  • session_id          │     │   session)           │
+    │  • turn_number         │     │                      │
+    │  • timestamp           │     │  JSON files or DB    │
+    │  • model               │     │  blob, auto-pruned   │
+    │  • token breakdown     │     └─────────────────────┘
+    │  • sections present    │
+    │  • tool count          │
+    │  • messages count      │
+    │  • frame_id            │
+    │  • trace_id (F035.2)   │
+    └────────────────────────┘
+```
+
+---
+
+## Design
+
+### 1. ContextLogEntry — Structured Metadata (Always Captured)
+
+Every API call gets a lightweight metadata record. This is cheap to store and covers 90% of debugging needs.
+
+```python
+@dataclass
+class ContextLogEntry:
+    id: str                          # UUID
+    session_id: str
+    turn_number: int
+    timestamp: str                   # ISO 8601
+    call_type: str                   # "chat", "reflection", "compaction", "subtask", "critic"
+    model: str                       # e.g. "claude-sonnet-4-20250514"
+    frame_id: str                    # "task", "conversation", "decision", etc.
+    trace_id: str | None             # F035.2 causal chain link
+
+    # Token breakdown by section
+    token_breakdown: dict[str, int]  # See section below
+
+    # What was included
+    sections_present: list[str]      # ["identity", "user_profile", "censors", "frame_instructions", ...]
+    tools_count: int                 # Number of tool definitions sent
+    tool_names: list[str]            # Which tools were included
+    messages_count: int              # Length of messages array
+    message_roles: dict[str, int]    # {"user": 3, "assistant": 3, "tool": 2}
+
+    # Working memory specifics
+    loaded_facts_count: int
+    loaded_decisions_count: int
+    loaded_procedures_count: int
+    loaded_episodes_count: int
+    recent_conversations_count: int
+
+    # Context budget
+    total_tokens_estimated: int      # Sum of all sections
+    context_window_size: int         # Model's max context
+    utilization_pct: float           # total / window size
+
+    # Response metadata (filled after response)
+    input_tokens_actual: int | None  # From API usage response
+    output_tokens: int | None
+    cache_creation_tokens: int | None
+    cache_read_tokens: int | None
+    duration_ms: float | None
+    stop_reason: str | None
+```
+
+### 2. Token Breakdown
+
+The most valuable single diagnostic: where are the tokens going?
+
+```python
+token_breakdown = {
+    "identity":           850,     # Static identity + character
+    "user_profile":       420,     # Facts about Tim
+    "censors":            380,     # Active censor rules
+    "frame_instructions": 250,     # Frame-specific tool guidance
+    "working_memory":     1200,    # Current task + loaded context
+    "related_decisions":  600,     # Past decisions
+    "relevant_facts":     450,     # Memory facts
+    "recent_conversations": 300,   # Episode summaries
+    "execution_ledger":   800,     # Prior tool calls this session
+    "diagnostic_nudges":  150,     # F024 Critic feedback
+    "corrections":        0,       # Pending corrections
+    "telegram_format":    200,     # Platform formatting rules
+    "tools_definition":   3500,    # Tool JSON schemas
+    "messages":           12000,   # Conversation history
+    "thinking_config":    0,       # Thinking params (not tokens, but tracked)
+}
+```
+
+Token estimation uses `len(text) / 4` as a fast approximation (no tokenizer dependency). The `input_tokens_actual` field from the API response provides the ground truth for calibration.
+
+### 3. Full Payload Capture (Optional, Ring Buffer)
+
+For deep debugging, optionally capture the complete API payload. This is large, so it's stored in a ring buffer — last N calls per session, auto-pruned.
+
+```python
+class FullPayloadStore:
+    """Ring buffer for full API payloads. In-memory + optional disk spill."""
+
+    def __init__(self, max_per_session: int = 10, max_total: int = 50):
+        self._store: dict[str, deque[ContextPayload]] = {}
+        self._max_per_session = max_per_session
+        self._max_total = max_total
+
+    def capture(self, session_id: str, entry_id: str, payload: dict) -> None:
+        """Store a full payload. Oldest auto-evicted when limit reached."""
+        ...
+
+    def get(self, entry_id: str) -> dict | None:
+        """Retrieve a full payload by context log entry ID."""
+        ...
+
+    def get_session(self, session_id: str) -> list[ContextPayload]:
+        """Get all captured payloads for a session (newest first)."""
+        ...
+```
+
+Configuration:
+- `CONTEXT_LOG_FULL_PAYLOAD`: `true`/`false` (default: `false`)
+- `CONTEXT_LOG_RING_SIZE`: max payloads per session (default: `10`)
+- `CONTEXT_LOG_MAX_TOTAL`: global cap (default: `50`)
+
+### 4. Hooking Into the Pipeline
+
+The `_build_api_payload()` method in `runner.py` is the single choke point where all API calls are assembled. The hook goes here:
+
+```python
+def _build_api_payload(self, system_prompt, messages, tools=None, ...):
+    # ... existing payload assembly ...
+    payload = { ... }
+
+    # F035.4: Log context metadata
+    if self._context_logger:
+        self._context_logger.log(
+            session_id=self._current_session_id,
+            turn_number=self._current_turn,
+            call_type=call_type,
+            system_prompt=system_prompt,
+            messages=messages,
+            tools=tools,
+            payload=payload,
+            frame_id=self._current_frame_id,
+        )
+
+    return payload
+```
+
+The `ContextLogger.log()` method:
+1. Parses the system prompt into sections (splits on known headers like `## Identity`, `## Active Censors`, etc.)
+2. Estimates tokens per section
+3. Creates a `ContextLogEntry` with structured metadata
+4. Optionally captures the full payload in the ring buffer
+5. Writes the entry to the DB (async, non-blocking — must not add latency to the API call path)
+
+After the API response returns, the entry is updated with actual usage tokens and duration.
+
+### 5. System Prompt Section Parser
+
+The system prompt is assembled from named sections. The parser identifies them for the token breakdown:
+
+```python
+SECTION_MARKERS = {
+    "## Current Date/Time":   "datetime",
+    "## Identity":            "identity",
+    "## User Profile":        "user_profile",
+    "## Context Safety":      "context_safety",
+    "## Active Censors":      "censors",
+    "## Current Frame":       "frame",
+    "## Working Memory":      "working_memory",
+    "## Related Decisions":   "related_decisions",
+    "## Relevant Facts":      "relevant_facts",
+    "## Recent Conversations":"recent_conversations",
+    "## Tool Instructions":   "frame_instructions",
+    "[Execution Ledger]":     "execution_ledger",
+    "[Previous Turn Corrections]": "corrections",
+    "## Output Formatting":   "telegram_format",
+}
+
+def parse_system_sections(system_prompt: str) -> dict[str, str]:
+    """Split the system prompt into named sections with their text content."""
+    ...
+```
+
+### 6. Database Schema
+
+```sql
+CREATE TABLE IF NOT EXISTS nous_system.context_log (
+    id          UUID PRIMARY KEY,
+    session_id  TEXT NOT NULL,
+    turn_number INTEGER NOT NULL,
+    timestamp   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+    call_type   TEXT NOT NULL,           -- chat, reflection, compaction, subtask, critic
+    model       TEXT NOT NULL,
+    frame_id    TEXT,
+    trace_id    TEXT,                     -- F035.2 link
+
+    -- Token breakdown (JSONB for flexibility)
+    token_breakdown     JSONB NOT NULL,
+    total_tokens_est    INTEGER NOT NULL,
+    context_window_size INTEGER NOT NULL,
+    utilization_pct     REAL NOT NULL,
+
+    -- Content inventory
+    sections_present    TEXT[] NOT NULL,
+    tools_count         INTEGER NOT NULL DEFAULT 0,
+    tool_names          TEXT[],
+    messages_count      INTEGER NOT NULL DEFAULT 0,
+    message_roles       JSONB,
+
+    -- Memory loading
+    loaded_facts        INTEGER NOT NULL DEFAULT 0,
+    loaded_decisions    INTEGER NOT NULL DEFAULT 0,
+    loaded_procedures   INTEGER NOT NULL DEFAULT 0,
+    loaded_episodes     INTEGER NOT NULL DEFAULT 0,
+    recent_conversations INTEGER NOT NULL DEFAULT 0,
+
+    -- Response (updated after API call completes)
+    input_tokens_actual INTEGER,
+    output_tokens       INTEGER,
+    cache_creation      INTEGER,
+    cache_read          INTEGER,
+    duration_ms         REAL,
+    stop_reason         TEXT
+);
+
+CREATE INDEX idx_context_log_session ON nous_system.context_log (session_id, turn_number);
+CREATE INDEX idx_context_log_time ON nous_system.context_log (timestamp DESC);
+CREATE INDEX idx_context_log_model ON nous_system.context_log (model);
+```
+
+### 7. REST API Endpoints
+
+**`GET /context/log?session_id=X&limit=20`**  
+List context log entries for a session (newest first). Returns metadata only (no full payloads).
+
+```json
+{
+  "entries": [
+    {
+      "id": "abc-123",
+      "turn_number": 5,
+      "timestamp": "2026-04-04T18:00:00Z",
+      "call_type": "chat",
+      "model": "claude-sonnet-4-20250514",
+      "frame_id": "task",
+      "total_tokens_est": 18500,
+      "utilization_pct": 9.2,
+      "input_tokens_actual": 19200,
+      "output_tokens": 1500,
+      "duration_ms": 8500,
+      "tools_count": 16,
+      "messages_count": 8,
+      "loaded_facts": 5,
+      "loaded_procedures": 3
+    }
+  ]
+}
+```
+
+**`GET /context/log/:id`**  
+Full metadata for a single entry including complete token breakdown.
+
+**`GET /context/log/:id/payload`**  
+Full API payload (if captured in ring buffer). Returns 404 if pruned.
+
+**`GET /context/log/:id/sections`**  
+Parsed system prompt sections with individual token counts. Useful for understanding what's eating context.
+
+**`GET /context/diff?a=<id>&b=<id>`**  
+Diff between two context log entries. Shows:
+- Sections added/removed
+- Token changes per section
+- Tools added/removed
+- Messages added/removed
+- Memory items that appeared/disappeared
+
+```json
+{
+  "a": "entry-id-1",
+  "b": "entry-id-2",
+  "token_delta": {
+    "total": +2400,
+    "by_section": {
+      "execution_ledger": +800,
+      "messages": +1200,
+      "working_memory": +400
+    }
+  },
+  "sections_added": ["corrections"],
+  "sections_removed": [],
+  "tools_added": [],
+  "tools_removed": ["web_search"],
+  "facts_delta": {
+    "added": ["Tim prefers dark mode"],
+    "removed": []
+  }
+}
+```
+
+### 8. Telegram Integration
+
+New `/context` command:
+
+```
+📋 Last API Call (Turn 5)
+  Model: claude-sonnet-4-20250514
+  Frame: task
+  Tokens: ~18.5K est / 19.2K actual (9.2% of window)
+  Duration: 8.5s
+
+  📊 Token Breakdown
+  • messages: 12,000 (62%)
+  • tools: 3,500 (18%)
+  • working_memory: 1,200 (6%)
+  • identity: 850 (4%)
+  • execution_ledger: 800 (4%)
+  • related_decisions: 600 (3%)
+  • relevant_facts: 450 (2%)
+  • other: 100 (1%)
+
+  📦 Loaded Memory
+  • 5 facts, 3 procedures, 2 decisions, 4 episodes
+
+  🔧 Tools: 16 (bash, read_file, write_file, ...)
+```
+
+Also extend `/status` with a compact context summary:
+
+```
+📋 Context (last call)
+  ~18.5K tokens (9.2% window) | 16 tools | 5 facts loaded
+```
+
+### 9. Dashboard Panel
+
+New "Context Inspector" panel in the Memory Dashboard (F021):
+
+- **Timeline view** — token usage over turns, stacked by section
+- **Section breakdown** — bar chart of token allocation per section
+- **Diff view** — select any two turns, see what changed
+- **Payload inspector** — raw JSON view of full payload (when captured)
+- **Memory loading** — which facts/decisions/procedures were loaded per turn
+- **Utilization trend** — context window fill % over time
+
+---
+
+## Configuration
+
+| Setting | Default | Description |
+|---------|---------|-------------|
+| `CONTEXT_LOG_ENABLED` | `true` | Enable context logging (metadata always) |
+| `CONTEXT_LOG_FULL_PAYLOAD` | `false` | Also capture full API payloads |
+| `CONTEXT_LOG_RING_SIZE` | `10` | Max full payloads per session |
+| `CONTEXT_LOG_MAX_TOTAL` | `50` | Global cap on stored payloads |
+| `CONTEXT_LOG_RETENTION_DAYS` | `30` | Auto-prune metadata older than this |
+
+---
+
+## Files Changed
+
+| File | Change | ~Lines |
+|------|--------|--------|
+| `nous/observability/context_logger.py` | **NEW** — `ContextLogger`, `ContextLogEntry`, `FullPayloadStore`, section parser | +250 |
+| `nous/api/runner.py` | Hook in `_build_api_payload()` and response handler | +40 |
+| `nous/db/migrations/` | context_log table + indexes | +30 |
+| `nous/api/rest.py` | `/context/log`, `/context/log/:id`, `/context/diff` endpoints | +120 |
+| `nous/telegram_bot.py` | `/context` command + `/status` extension | +50 |
+| `nous/config.py` | New settings | +10 |
+| `tests/test_context_logger.py` | **NEW** — unit + integration tests | +300 |
+| **Total** | | ~800 |
+
+---
+
+## Design Decisions
+
+| # | Decision | Rationale |
+|---|----------|-----------|
+| D1 | Metadata always, full payload optional | Full payloads are 50-200KB each. Metadata is ~500 bytes. Default to cheap. |
+| D2 | `len(text)/4` for token estimation | Fast, no tokenizer dependency. Actual tokens from API response for calibration. |
+| D3 | Ring buffer for full payloads | Bounded memory. Most debugging needs only recent calls. |
+| D4 | Hook at `_build_api_payload()` | Single choke point — all API calls (chat, reflection, compaction, subtask) flow through here. |
+| D5 | Async DB write | Context logging must not add latency to the API call path. Fire-and-forget insert. |
+| D6 | Section parser based on known markers | System prompt structure is controlled by `_build_system_prompt()`. Markers are stable. |
+| D7 | Diff endpoint as first-class API | "What changed between turns" is the #1 debugging question. Make it zero-effort. |
+| D8 | 30-day retention default | Context logs are diagnostic, not archival. Auto-prune keeps DB manageable. |
+
+---
+
+## Interaction with Other F035 Specs
+
+- **F035.1 (Event Bus Stats):** Context log entries can be correlated with event bus activity by timestamp and session_id
+- **F035.2 (Causal Chain Tracing):** Context log entries carry `trace_id` for linking to causal chains — "what context did the model see when it made autonomous decision X?"
+- **F035.3 (Behavioral Drift Detection):** Context utilization trends feed into drift detection — "is the system prompt growing over time? Are more facts being loaded per turn?"
+
+---
+
+## Acceptance Criteria
+
+- [ ] Every API call through `_build_api_payload()` produces a `ContextLogEntry`
+- [ ] Token breakdown correctly identifies and measures all system prompt sections
+- [ ] Actual API usage tokens (input/output/cache) are recorded after response
+- [ ] `GET /context/log` returns entries filtered by session, paginated
+- [ ] `GET /context/log/:id` returns full metadata including token breakdown
+- [ ] `GET /context/log/:id/payload` returns full payload when captured (404 otherwise)
+- [ ] `GET /context/diff` produces meaningful diff between two entries
+- [ ] `/context` Telegram command shows last-call summary with token breakdown
+- [ ] `/status` includes compact context summary line
+- [ ] Full payload capture is off by default, toggleable via config
+- [ ] Ring buffer correctly evicts oldest payloads when limits are reached
+- [ ] Context logging adds < 1ms latency to API call path (async write)
+- [ ] Retention auto-prune runs daily, removes entries older than configured days
+- [ ] All tests pass (unit + integration)
+
+---
+
+## Future Extensions
+
+- **Token budget optimizer** — use historical context logs to recommend which sections to trim when approaching context limits
+- **Prompt A/B testing** — compare context configurations and their effect on response quality
+- **Anomaly detection** — flag turns where context utilization spikes or drops unexpectedly (feeds into F035.3)
+- **Cost attribution** — token breakdown × pricing = cost per section per turn