From 35b6a27d0ca8c580c0c92187452dbe0972a63521 Mon Sep 17 00:00:00 2001 From: Nous Date: Sat, 4 Apr 2026 18:18:43 +0000 Subject: [PATCH] =?UTF-8?q?F035.4:=20Context=20Visibility=20=E2=80=94=20wh?= =?UTF-8?q?at=20the=20model=20actually=20sees?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit New observability sub-spec for full LLM context window transparency: - ContextLogEntry with token breakdown by system prompt section - Full payload ring buffer (optional, bounded) - REST endpoints: /context/log, /context/diff - Telegram /context command - Dashboard Context Inspector panel - Hooks into _build_api_payload() single choke point Updated F035 umbrella spec: - Added Layer 4 description - Added F035.4 to sub-specs table - Updated sequencing rationale (F035.4 is independent, parallelizable) - Added success criterion #6 for context visibility --- docs/features/F035-observability.md | 8 +- docs/features/F035.4-context-visibility.md | 493 +++++++++++++++++++++ 2 files changed, 499 insertions(+), 2 deletions(-) create mode 100644 docs/features/F035.4-context-visibility.md diff --git a/docs/features/F035-observability.md b/docs/features/F035-observability.md index 3b1c2cd..db04c18 100644 --- a/docs/features/F035-observability.md +++ b/docs/features/F035-observability.md @@ -32,7 +32,7 @@ Minsky's Chapter 6 ("Self-Knowledge is Dangerous") warns that unrestricted self- --- -## Architecture — Three Layers +## Architecture — Four Layers ``` ┌─────────────────────────────────────────────────────────┐ @@ -62,6 +62,8 @@ Minsky's Chapter 6 ("Self-Knowledge is Dangerous") warns that unrestricted self- **Layer 3 — Behavioral Drift Detection (F035.3):** Periodic snapshots of key metrics with trend analysis. Catches slow drift that individual events don't reveal. "Is the system changing in ways nobody intended?" +**Layer 4 — Context Visibility (F035.4):** Full transparency into what the LLM actually sees on every API call. Token breakdown by section, memory loading inventory, turn-over-turn diffs. "What did the model see when it made that choice?" + --- ## Sub-Specs @@ -71,8 +73,9 @@ Minsky's Chapter 6 ("Self-Knowledge is Dangerous") warns that unrestricted self- | F035.1 | Event Bus Observability | P1 | F006 | | F035.2 | Autonomous Action Audit Trail | P1 | F035.1 | | F035.3 | Behavioral Drift Detection | P2 | F035.2 | +| F035.4 | Context Visibility | P1 | F035 (umbrella) | -**Sequencing rationale:** F035.1 gives us the infrastructure (stats collection, endpoints). F035.2 adds causal metadata to events. F035.3 builds on both to detect trends. Each is independently useful. +**Sequencing rationale:** F035.1 gives us the infrastructure (stats collection, endpoints). F035.2 adds causal metadata to events. F035.3 builds on both to detect trends. F035.4 is independent — it hooks into the runner pipeline, not the event bus — and can be built in parallel with F035.1-3. Each sub-spec is independently useful. --- @@ -89,3 +92,4 @@ Minsky's Chapter 6 ("Self-Knowledge is Dangerous") warns that unrestricted self- 3. Behavioral trends (fact growth rate, censor changes, check frequency drift) are tracked and anomalies flagged 4. The dashboard has an "Autonomous Activity" panel showing recent self-modifications 5. None of this adds measurable latency to the hot path (event processing) +6. For any API call, you can see the full token breakdown by context section and which memory items were loaded diff --git a/docs/features/F035.4-context-visibility.md b/docs/features/F035.4-context-visibility.md new file mode 100644 index 0000000..67cf625 --- /dev/null +++ b/docs/features/F035.4-context-visibility.md @@ -0,0 +1,493 @@ +# F035.4 — Context Visibility (What the Model Actually Sees) + +**Status:** PROPOSED +**Priority:** P1 (critical for debugging prompt assembly) +**Depends on:** F035 (Observability umbrella) +**Author:** Nous + Tim +**Date:** 2026-04-04 + +--- + +## Problem + +Nous assembles a complex context window for every LLM API call. The system prompt alone is a composite of: + +- Static identity & character +- User profile (facts about Tim) +- Active censors +- Cognitive frame + frame-specific instructions +- Working memory (current task, loaded context items) +- Related decisions +- Relevant facts +- Recent conversations +- Execution ledger (prior tool calls this session) +- Diagnostic nudges (from F024 Critic) +- Pending corrections +- Platform-specific formatting rules (Telegram) + +On top of that, the **messages array** contains the full conversation history for the session — potentially including tool calls, tool results, and multi-turn context. + +Today there is **zero visibility** into what gets assembled. When something goes wrong — Nous doesn't use a fact it should know, repeats a tool call, or behaves oddly — debugging requires reading code and mentally reconstructing what the context must have looked like. There's no way to answer: + +- "What was in the system prompt on turn 3?" +- "How many tokens did the tools definition consume?" +- "Did the relevant fact about X actually get loaded?" +- "What changed between turn 5 and turn 6?" + +This is the difference between F035.1-3 (what the *system* did) and F035.4 (what the *model* was told). The other layers trace autonomous actions and system health. This layer traces the **information supply chain** to the LLM itself. + +--- + +## Design Philosophy + +The model's behavior is fully determined by its context window. If you can see exactly what went in, you can explain exactly what came out. This is the **most direct debugging tool possible** — not inference about what might have happened, but the actual artifact. + +The key tension is **storage cost vs diagnostic value**. Full prompt payloads are large (50K-200K tokens per call). We solve this with tiered retention: structured metadata always, full payloads on-demand. + +--- + +## Architecture + +``` +┌─────────────────────────────────────────────────────────┐ +│ Dashboard / API / Telegram │ +│ Browse by session → turn → API call │ +│ Token breakdown charts, context diff view │ +└───────────────┬───────────────────────────────────────────┘ + │ + ┌───────────▼────────────┐ + │ ContextLog Service │ + │ │ + │ • Intercepts payload │ + │ at _build_api_ │ + │ payload() exit │ + │ • Extracts metadata │ + │ • Stores structured │ + │ summary + optional │ + │ full payload │ + └───────────┬────────────┘ + │ + ┌───────────▼────────────┐ ┌─────────────────────┐ + │ context_log table │ │ Full Payload Store │ + │ (always written) │ │ (ring buffer, last │ + │ │ │ N calls per │ + │ • session_id │ │ session) │ + │ • turn_number │ │ │ + │ • timestamp │ │ JSON files or DB │ + │ • model │ │ blob, auto-pruned │ + │ • token breakdown │ └─────────────────────┘ + │ • sections present │ + │ • tool count │ + │ • messages count │ + │ • frame_id │ + │ • trace_id (F035.2) │ + └────────────────────────┘ +``` + +--- + +## Design + +### 1. ContextLogEntry — Structured Metadata (Always Captured) + +Every API call gets a lightweight metadata record. This is cheap to store and covers 90% of debugging needs. + +```python +@dataclass +class ContextLogEntry: + id: str # UUID + session_id: str + turn_number: int + timestamp: str # ISO 8601 + call_type: str # "chat", "reflection", "compaction", "subtask", "critic" + model: str # e.g. "claude-sonnet-4-20250514" + frame_id: str # "task", "conversation", "decision", etc. + trace_id: str | None # F035.2 causal chain link + + # Token breakdown by section + token_breakdown: dict[str, int] # See section below + + # What was included + sections_present: list[str] # ["identity", "user_profile", "censors", "frame_instructions", ...] + tools_count: int # Number of tool definitions sent + tool_names: list[str] # Which tools were included + messages_count: int # Length of messages array + message_roles: dict[str, int] # {"user": 3, "assistant": 3, "tool": 2} + + # Working memory specifics + loaded_facts_count: int + loaded_decisions_count: int + loaded_procedures_count: int + loaded_episodes_count: int + recent_conversations_count: int + + # Context budget + total_tokens_estimated: int # Sum of all sections + context_window_size: int # Model's max context + utilization_pct: float # total / window size + + # Response metadata (filled after response) + input_tokens_actual: int | None # From API usage response + output_tokens: int | None + cache_creation_tokens: int | None + cache_read_tokens: int | None + duration_ms: float | None + stop_reason: str | None +``` + +### 2. Token Breakdown + +The most valuable single diagnostic: where are the tokens going? + +```python +token_breakdown = { + "identity": 850, # Static identity + character + "user_profile": 420, # Facts about Tim + "censors": 380, # Active censor rules + "frame_instructions": 250, # Frame-specific tool guidance + "working_memory": 1200, # Current task + loaded context + "related_decisions": 600, # Past decisions + "relevant_facts": 450, # Memory facts + "recent_conversations": 300, # Episode summaries + "execution_ledger": 800, # Prior tool calls this session + "diagnostic_nudges": 150, # F024 Critic feedback + "corrections": 0, # Pending corrections + "telegram_format": 200, # Platform formatting rules + "tools_definition": 3500, # Tool JSON schemas + "messages": 12000, # Conversation history + "thinking_config": 0, # Thinking params (not tokens, but tracked) +} +``` + +Token estimation uses `len(text) / 4` as a fast approximation (no tokenizer dependency). The `input_tokens_actual` field from the API response provides the ground truth for calibration. + +### 3. Full Payload Capture (Optional, Ring Buffer) + +For deep debugging, optionally capture the complete API payload. This is large, so it's stored in a ring buffer — last N calls per session, auto-pruned. + +```python +class FullPayloadStore: + """Ring buffer for full API payloads. In-memory + optional disk spill.""" + + def __init__(self, max_per_session: int = 10, max_total: int = 50): + self._store: dict[str, deque[ContextPayload]] = {} + self._max_per_session = max_per_session + self._max_total = max_total + + def capture(self, session_id: str, entry_id: str, payload: dict) -> None: + """Store a full payload. Oldest auto-evicted when limit reached.""" + ... + + def get(self, entry_id: str) -> dict | None: + """Retrieve a full payload by context log entry ID.""" + ... + + def get_session(self, session_id: str) -> list[ContextPayload]: + """Get all captured payloads for a session (newest first).""" + ... +``` + +Configuration: +- `CONTEXT_LOG_FULL_PAYLOAD`: `true`/`false` (default: `false`) +- `CONTEXT_LOG_RING_SIZE`: max payloads per session (default: `10`) +- `CONTEXT_LOG_MAX_TOTAL`: global cap (default: `50`) + +### 4. Hooking Into the Pipeline + +The `_build_api_payload()` method in `runner.py` is the single choke point where all API calls are assembled. The hook goes here: + +```python +def _build_api_payload(self, system_prompt, messages, tools=None, ...): + # ... existing payload assembly ... + payload = { ... } + + # F035.4: Log context metadata + if self._context_logger: + self._context_logger.log( + session_id=self._current_session_id, + turn_number=self._current_turn, + call_type=call_type, + system_prompt=system_prompt, + messages=messages, + tools=tools, + payload=payload, + frame_id=self._current_frame_id, + ) + + return payload +``` + +The `ContextLogger.log()` method: +1. Parses the system prompt into sections (splits on known headers like `## Identity`, `## Active Censors`, etc.) +2. Estimates tokens per section +3. Creates a `ContextLogEntry` with structured metadata +4. Optionally captures the full payload in the ring buffer +5. Writes the entry to the DB (async, non-blocking — must not add latency to the API call path) + +After the API response returns, the entry is updated with actual usage tokens and duration. + +### 5. System Prompt Section Parser + +The system prompt is assembled from named sections. The parser identifies them for the token breakdown: + +```python +SECTION_MARKERS = { + "## Current Date/Time": "datetime", + "## Identity": "identity", + "## User Profile": "user_profile", + "## Context Safety": "context_safety", + "## Active Censors": "censors", + "## Current Frame": "frame", + "## Working Memory": "working_memory", + "## Related Decisions": "related_decisions", + "## Relevant Facts": "relevant_facts", + "## Recent Conversations":"recent_conversations", + "## Tool Instructions": "frame_instructions", + "[Execution Ledger]": "execution_ledger", + "[Previous Turn Corrections]": "corrections", + "## Output Formatting": "telegram_format", +} + +def parse_system_sections(system_prompt: str) -> dict[str, str]: + """Split the system prompt into named sections with their text content.""" + ... +``` + +### 6. Database Schema + +```sql +CREATE TABLE IF NOT EXISTS nous_system.context_log ( + id UUID PRIMARY KEY, + session_id TEXT NOT NULL, + turn_number INTEGER NOT NULL, + timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(), + call_type TEXT NOT NULL, -- chat, reflection, compaction, subtask, critic + model TEXT NOT NULL, + frame_id TEXT, + trace_id TEXT, -- F035.2 link + + -- Token breakdown (JSONB for flexibility) + token_breakdown JSONB NOT NULL, + total_tokens_est INTEGER NOT NULL, + context_window_size INTEGER NOT NULL, + utilization_pct REAL NOT NULL, + + -- Content inventory + sections_present TEXT[] NOT NULL, + tools_count INTEGER NOT NULL DEFAULT 0, + tool_names TEXT[], + messages_count INTEGER NOT NULL DEFAULT 0, + message_roles JSONB, + + -- Memory loading + loaded_facts INTEGER NOT NULL DEFAULT 0, + loaded_decisions INTEGER NOT NULL DEFAULT 0, + loaded_procedures INTEGER NOT NULL DEFAULT 0, + loaded_episodes INTEGER NOT NULL DEFAULT 0, + recent_conversations INTEGER NOT NULL DEFAULT 0, + + -- Response (updated after API call completes) + input_tokens_actual INTEGER, + output_tokens INTEGER, + cache_creation INTEGER, + cache_read INTEGER, + duration_ms REAL, + stop_reason TEXT +); + +CREATE INDEX idx_context_log_session ON nous_system.context_log (session_id, turn_number); +CREATE INDEX idx_context_log_time ON nous_system.context_log (timestamp DESC); +CREATE INDEX idx_context_log_model ON nous_system.context_log (model); +``` + +### 7. REST API Endpoints + +**`GET /context/log?session_id=X&limit=20`** +List context log entries for a session (newest first). Returns metadata only (no full payloads). + +```json +{ + "entries": [ + { + "id": "abc-123", + "turn_number": 5, + "timestamp": "2026-04-04T18:00:00Z", + "call_type": "chat", + "model": "claude-sonnet-4-20250514", + "frame_id": "task", + "total_tokens_est": 18500, + "utilization_pct": 9.2, + "input_tokens_actual": 19200, + "output_tokens": 1500, + "duration_ms": 8500, + "tools_count": 16, + "messages_count": 8, + "loaded_facts": 5, + "loaded_procedures": 3 + } + ] +} +``` + +**`GET /context/log/:id`** +Full metadata for a single entry including complete token breakdown. + +**`GET /context/log/:id/payload`** +Full API payload (if captured in ring buffer). Returns 404 if pruned. + +**`GET /context/log/:id/sections`** +Parsed system prompt sections with individual token counts. Useful for understanding what's eating context. + +**`GET /context/diff?a=&b=`** +Diff between two context log entries. Shows: +- Sections added/removed +- Token changes per section +- Tools added/removed +- Messages added/removed +- Memory items that appeared/disappeared + +```json +{ + "a": "entry-id-1", + "b": "entry-id-2", + "token_delta": { + "total": +2400, + "by_section": { + "execution_ledger": +800, + "messages": +1200, + "working_memory": +400 + } + }, + "sections_added": ["corrections"], + "sections_removed": [], + "tools_added": [], + "tools_removed": ["web_search"], + "facts_delta": { + "added": ["Tim prefers dark mode"], + "removed": [] + } +} +``` + +### 8. Telegram Integration + +New `/context` command: + +``` +📋 Last API Call (Turn 5) + Model: claude-sonnet-4-20250514 + Frame: task + Tokens: ~18.5K est / 19.2K actual (9.2% of window) + Duration: 8.5s + + 📊 Token Breakdown + • messages: 12,000 (62%) + • tools: 3,500 (18%) + • working_memory: 1,200 (6%) + • identity: 850 (4%) + • execution_ledger: 800 (4%) + • related_decisions: 600 (3%) + • relevant_facts: 450 (2%) + • other: 100 (1%) + + 📦 Loaded Memory + • 5 facts, 3 procedures, 2 decisions, 4 episodes + + 🔧 Tools: 16 (bash, read_file, write_file, ...) +``` + +Also extend `/status` with a compact context summary: + +``` +📋 Context (last call) + ~18.5K tokens (9.2% window) | 16 tools | 5 facts loaded +``` + +### 9. Dashboard Panel + +New "Context Inspector" panel in the Memory Dashboard (F021): + +- **Timeline view** — token usage over turns, stacked by section +- **Section breakdown** — bar chart of token allocation per section +- **Diff view** — select any two turns, see what changed +- **Payload inspector** — raw JSON view of full payload (when captured) +- **Memory loading** — which facts/decisions/procedures were loaded per turn +- **Utilization trend** — context window fill % over time + +--- + +## Configuration + +| Setting | Default | Description | +|---------|---------|-------------| +| `CONTEXT_LOG_ENABLED` | `true` | Enable context logging (metadata always) | +| `CONTEXT_LOG_FULL_PAYLOAD` | `false` | Also capture full API payloads | +| `CONTEXT_LOG_RING_SIZE` | `10` | Max full payloads per session | +| `CONTEXT_LOG_MAX_TOTAL` | `50` | Global cap on stored payloads | +| `CONTEXT_LOG_RETENTION_DAYS` | `30` | Auto-prune metadata older than this | + +--- + +## Files Changed + +| File | Change | ~Lines | +|------|--------|--------| +| `nous/observability/context_logger.py` | **NEW** — `ContextLogger`, `ContextLogEntry`, `FullPayloadStore`, section parser | +250 | +| `nous/api/runner.py` | Hook in `_build_api_payload()` and response handler | +40 | +| `nous/db/migrations/` | context_log table + indexes | +30 | +| `nous/api/rest.py` | `/context/log`, `/context/log/:id`, `/context/diff` endpoints | +120 | +| `nous/telegram_bot.py` | `/context` command + `/status` extension | +50 | +| `nous/config.py` | New settings | +10 | +| `tests/test_context_logger.py` | **NEW** — unit + integration tests | +300 | +| **Total** | | ~800 | + +--- + +## Design Decisions + +| # | Decision | Rationale | +|---|----------|-----------| +| D1 | Metadata always, full payload optional | Full payloads are 50-200KB each. Metadata is ~500 bytes. Default to cheap. | +| D2 | `len(text)/4` for token estimation | Fast, no tokenizer dependency. Actual tokens from API response for calibration. | +| D3 | Ring buffer for full payloads | Bounded memory. Most debugging needs only recent calls. | +| D4 | Hook at `_build_api_payload()` | Single choke point — all API calls (chat, reflection, compaction, subtask) flow through here. | +| D5 | Async DB write | Context logging must not add latency to the API call path. Fire-and-forget insert. | +| D6 | Section parser based on known markers | System prompt structure is controlled by `_build_system_prompt()`. Markers are stable. | +| D7 | Diff endpoint as first-class API | "What changed between turns" is the #1 debugging question. Make it zero-effort. | +| D8 | 30-day retention default | Context logs are diagnostic, not archival. Auto-prune keeps DB manageable. | + +--- + +## Interaction with Other F035 Specs + +- **F035.1 (Event Bus Stats):** Context log entries can be correlated with event bus activity by timestamp and session_id +- **F035.2 (Causal Chain Tracing):** Context log entries carry `trace_id` for linking to causal chains — "what context did the model see when it made autonomous decision X?" +- **F035.3 (Behavioral Drift Detection):** Context utilization trends feed into drift detection — "is the system prompt growing over time? Are more facts being loaded per turn?" + +--- + +## Acceptance Criteria + +- [ ] Every API call through `_build_api_payload()` produces a `ContextLogEntry` +- [ ] Token breakdown correctly identifies and measures all system prompt sections +- [ ] Actual API usage tokens (input/output/cache) are recorded after response +- [ ] `GET /context/log` returns entries filtered by session, paginated +- [ ] `GET /context/log/:id` returns full metadata including token breakdown +- [ ] `GET /context/log/:id/payload` returns full payload when captured (404 otherwise) +- [ ] `GET /context/diff` produces meaningful diff between two entries +- [ ] `/context` Telegram command shows last-call summary with token breakdown +- [ ] `/status` includes compact context summary line +- [ ] Full payload capture is off by default, toggleable via config +- [ ] Ring buffer correctly evicts oldest payloads when limits are reached +- [ ] Context logging adds < 1ms latency to API call path (async write) +- [ ] Retention auto-prune runs daily, removes entries older than configured days +- [ ] All tests pass (unit + integration) + +--- + +## Future Extensions + +- **Token budget optimizer** — use historical context logs to recommend which sections to trim when approaching context limits +- **Prompt A/B testing** — compare context configurations and their effect on response quality +- **Anomaly detection** — flag turns where context utilization spikes or drops unexpectedly (feeds into F035.3) +- **Cost attribution** — token breakdown × pricing = cost per section per turn