tfatykhov · tfatykhov · Apr 4, 2026 · Apr 4, 2026
diff --git a/docs/features/F035-observability.md b/docs/features/F035-observability.md
@@ -0,0 +1,91 @@
+# F035: Observability — Knowing What the Mind Is Doing
+
+**Status:** PROPOSED
+**Author:** Nous + Tim
+**Created:** 2026-04-04
+**Dependencies:** F006 (Event Bus), F034 (Heartbeat), F026 (Execution Ledger)
+
+---
+
+## Problem
+
+Nous now has 4+ autonomous subsystems that modify state without human initiation:
+
+- **Heartbeat** (F034) — monitors external services, creates findings, triggers triage
+- **Sleep consolidation** — rewrites memory (compaction, reflection, contradiction resolution)
+- **Self-tuning** (F034.3) — adjusts heartbeat check intervals based on yield
+- **Fact/episode lifecycle** — admission, dedup, pruning all happen automatically
+
+Each system is individually well-designed. But when something unexpected happens — a fact disappears, a check stops running, behavior shifts — there's no way to answer: **"what chain of autonomous decisions led here?"**
+
+The event bus exists (F006) and persists events to DB, but there are no processing stats, no causal chains linking autonomous actions, and no trend detection for behavioral drift. Debugging requires reading raw logs and manually reconstructing causality.
+
+This is the difference between a system that *works* and a system you can *trust*. As Nous becomes more autonomous, observability isn't optional — it's the mechanism for accountable self-modification.
+
+---
+
+## Design Philosophy
+
+The right mental model isn't "monitoring a server." It's closer to **journaling for a mind** — the system should be able to answer "why did I change my mind about X?" the same way a person with good self-awareness can.
+
+Minsky's Chapter 6 ("Self-Knowledge is Dangerous") warns that unrestricted self-modification leads to instability. The observability layer is the read-only self-knowledge that makes self-modification safe: you can see what happened and why, but the audit trail itself can't be modified by the systems it monitors.
+
+---
+
+## Architecture — Three Layers
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                    Dashboard / API                        │
+│         (query any layer, visualize trends)               │
+└────────────┬──────────────────┬──────────────────┬───────┘
+             │                  │                  │
+    ┌────────▼────────┐ ┌──────▼───────┐ ┌────────▼────────┐
+    │   F035.1        │ │   F035.2     │ │   F035.3        │
+    │   Event Bus     │ │   Causal     │ │   Behavioral    │
+    │   Stats         │ │   Chains     │ │   Drift         │
+    │                 │ │              │ │   Detection     │
+    │ "Is the system  │ │ "Why did     │ │ "Is the system  │
+    │  healthy now?"  │ │  this happen?"│ │  changing?"     │
+    └─────────────────┘ └──────────────┘ └─────────────────┘
+             ▲                  ▲                  ▲
+             │                  │                  │
+    ┌────────┴──────────────────┴──────────────────┴───────┐
+    │                    Event Bus (F006)                    │
+    │              All events flow through here              │
+    └──────────────────────────────────────────────────────┘
+```
+
+**Layer 1 — Event Bus Stats (F035.1):** Real-time operational health. Event throughput, handler success/fail rates, queue depth. "Is the system healthy right now?"
+
+**Layer 2 — Causal Chain Tracing (F035.2):** Every autonomous action gets a `caused_by` link back to its trigger. Queryable audit trail. "Why did Nous do X?"
+
+**Layer 3 — Behavioral Drift Detection (F035.3):** Periodic snapshots of key metrics with trend analysis. Catches slow drift that individual events don't reveal. "Is the system changing in ways nobody intended?"
+
+---
+
+## Sub-Specs
+
+| Spec | Title | Priority | Depends On |
+|------|-------|----------|------------|
+| F035.1 | Event Bus Observability | P1 | F006 |
+| F035.2 | Autonomous Action Audit Trail | P1 | F035.1 |
+| F035.3 | Behavioral Drift Detection | P2 | F035.2 |
+
+**Sequencing rationale:** F035.1 gives us the infrastructure (stats collection, endpoints). F035.2 adds causal metadata to events. F035.3 builds on both to detect trends. Each is independently useful.
+
+---
+
+## What This Supersedes
+
+- **006.1 (Event Bus Observability)** — F035.1 absorbs and modernizes this planned spec. The original 006.1 was scoped before heartbeat, sleep consolidation, and self-tuning existed. F035.1 covers the same ground but accounts for the current architecture.
+
+---
+
+## Success Criteria
+
+1. After any autonomous action, you can trace the full causal chain back to the originating trigger
+2. Handler health (success/fail rates) is visible in real-time via API and Telegram
+3. Behavioral trends (fact growth rate, censor changes, check frequency drift) are tracked and anomalies flagged
+4. The dashboard has an "Autonomous Activity" panel showing recent self-modifications
+5. None of this adds measurable latency to the hot path (event processing)
diff --git a/docs/features/F035.1-event-bus-observability.md b/docs/features/F035.1-event-bus-observability.md
@@ -0,0 +1,242 @@
+# F035.1 — Event Bus Observability
+
+**Status:** PROPOSED
+**Priority:** P1 (foundation — F035.2 and F035.3 build on this)
+**Depends on:** F006 (Event Bus), F035 (Observability umbrella)
+**Supersedes:** 006.1 (Event Bus Observability — planned, never built)
+**Author:** Nous + Tim
+**Date:** 2026-04-04
+
+---
+
+## Problem
+
+The event bus processes all system events but has zero operational visibility. The only insight into event processing is:
+
+- `bus.pending` — queue depth (single int)
+- `logger` calls scattered across handlers
+- Raw rows in `nous_system.events` (unindexed, no aggregation)
+
+You can't answer basic questions:
+- How many events processed in the last hour?
+- Which handlers are failing? How often?
+- Is the queue backing up?
+- When did the sleep handler last run successfully?
+- Are any handlers silently broken?
+
+---
+
+## Goals
+
+1. **In-memory stats** — event counts by type, handler success/fail/timing, queue health
+2. **Ring buffer** — last N events for quick inspection without DB queries
+3. **REST endpoints** — `/events/stats` and `/events/recent` for API/dashboard access
+4. **Telegram integration** — event bus health in `/status` output
+5. **Handler-level stats** — per-handler invocation counts, error rates, avg latency
+6. **Zero hot-path cost** — counter increments only, no allocations on the event path
+
+---
+
+## Design
+
+### 1. EventBusStats Class
+
+New class in `nous/events.py`. Purely in-memory — no DB writes on the hot path.
+
+```python
+@dataclass
+class HandlerStat:
+    name: str
+    invocations: int = 0
+    successes: int = 0
+    errors: int = 0
+    last_invoked: float | None = None      # time.monotonic()
+    last_error: float | None = None
+    last_error_msg: str | None = None
+    total_duration_ms: float = 0.0
+
+@dataclass
+class RecentEvent:
+    type: str
+    timestamp: str              # ISO format
+    handlers_invoked: int
+    handlers_failed: int
+    duration_ms: float
+    session_id: str | None = None
+
+class EventBusStats:
+    def __init__(self, recent_limit: int = 100):
+        self._event_counts: dict[str, int] = defaultdict(int)
+        self._handler_stats: dict[str, HandlerStat] = {}
+        self._recent: deque[RecentEvent] = deque(maxlen=recent_limit)
+        self._total_processed: int = 0
+        self._total_dropped: int = 0
+        self._started_at: float = time.monotonic()
+```
+
+**Key methods:**
+- `record_event(event, handlers_invoked, handlers_failed, duration_ms)` — called after dispatch
+- `record_handler_success(handler_name, duration_ms)` — called per handler
+- `record_handler_error(handler_name, error_msg)` — called on handler failure
+- `record_drop()` — called when queue is full
+- `to_dict()` → serializable stats for API
+- `recent_events(limit)` → newest-first list from ring buffer
+
+### 2. EventBus Wiring
+
+Modify `EventBus` to populate stats:
+
+- **`__init__`** — create `self.stats = EventBusStats()`
+- **`emit()`** — on `QueueFull`, call `self.stats.record_drop()`
+- **`_dispatch()`** — time the full dispatch, count handler results, call `record_event()`
+- **`_safe_handle()`** — return `bool` (success/fail), call `record_handler_success()` or `record_handler_error()` with timing
+
+The existing error isolation behavior is unchanged — `_safe_handle` still catches all exceptions. The only addition is recording the outcome.
+
+### 3. REST Endpoints
+
+**`GET /events/stats`**
+```json
+{
+  "uptime_seconds": 3600,
+  "total_processed": 142,
+  "total_dropped": 0,
+  "queue_depth": 0,
+  "event_counts": {
+    "turn_completed": 85,
+    "heartbeat_tick": 24,
+    "episode_completed": 8,
+    "heartbeat_triage": 3,
+    "sleep_completed": 1
+  },
+  "handlers": {
+    "SessionTimeoutMonitor.on_activity": {
+      "invocations": 85,
+      "successes": 85,
+      "errors": 0,
+      "error_rate": 0.0,
+      "last_invoked_ago_s": 120,
+      "avg_duration_ms": 0.1
+    },
+    "EpisodeSummarizer.handle": {
+      "invocations": 12,
+      "successes": 8,
+      "errors": 4,
+      "error_rate": 0.33,
+      "last_error_ago_s": 600,
+      "last_error_msg": "LLM call failed: 429",
+      "avg_duration_ms": 2500.0
+    }
+  }
+}
+```
+
+**`GET /events/recent?limit=20`**
+Returns from in-memory ring buffer (fast, no DB). Newest first.
+
+```json
+{
+  "events": [
+    {
+      "type": "turn_completed",
+      "timestamp": "2026-04-04T17:30:00Z",
+      "handlers_invoked": 2,
+      "handlers_failed": 0,
+      "duration_ms": 3.2
+    }
+  ],
+  "source": "memory",
+  "count": 20
+}
+```
+
+### 4. Handler Stats Exposure
+
+Each major handler gets a `get_stats()` method returning its internal state:
+
+**SessionTimeoutMonitor:**
+- tracked_sessions count + per-session idle time
+- sleep_emitted flag
+- global idle seconds
+
+**SleepHandler:**
+- total_sleeps, last_sleep_at, last_phases_completed
+- currently_sleeping flag, last_interrupted
+
+**HeartbeatRunner:**
+- total_ticks, total_findings, checks_run_by_name
+- last_tick_at, currently_running
+
+These are included in the `/events/stats` response under their own keys.
+
+### 5. Telegram Integration
+
+Extend `/status` output with event bus section:
+
+```
+📡 Event Bus
+  142 events processed, 0 dropped
+  Queue: 0 pending | Uptime: 6h 30m
+
+  Handlers:
+  ✅ SessionMonitor: 85/85
+  ⚠️ EpisodeSummarizer: 8/12 (33% errors)
+  ✅ FactExtractor: 7/8
+  ✅ SleepHandler: 1/1
+  ✅ HeartbeatRunner: 24 ticks, 3 findings
+```
+
+Error flag (⚠️) shown when error_rate > 10%.
+
+### 6. Dashboard Panel (optional, P2)
+
+New panel in the Memory Dashboard: "System Health"
+- Event throughput chart (events/minute over last hour)
+- Handler health table (success rate, avg latency)
+- Queue depth gauge
+- Reuses `/events/stats` endpoint
+
+---
+
+## Files Changed
+
+| File | Change | ~Lines |
+|------|--------|--------|
+| `nous/events.py` | `EventBusStats`, `HandlerStat`, `RecentEvent` + wiring | +100 |
+| `nous/api/rest.py` | `GET /events/stats`, `GET /events/recent` | +60 |
+| `nous/handlers/session_monitor.py` | `get_stats()` | +15 |
+| `nous/handlers/sleep_handler.py` | tracking fields + `get_stats()` | +20 |
+| `nous/heartbeat/runner.py` | `get_stats()` | +15 |
+| `nous/telegram_bot.py` | event bus section in `/status` | +30 |
+| `tests/test_event_bus_observability.py` | **NEW** | +200 |
+| **Total** | | ~440 |
+
+---
+
+## Design Decisions
+
+| # | Decision | Rationale |
+|---|----------|-----------|
+| D1 | In-memory stats, not DB | Hot-path cost must be zero. Counter increments are O(1). |
+| D2 | Ring buffer (deque, maxlen=100) | Fixed memory (~10KB), O(1) append, no unbounded growth. |
+| D3 | `time.monotonic()` for timing | Wall clock can shift; monotonic doesn't. "Ago" times are relative. |
+| D4 | `_safe_handle` returns bool | Lets `_dispatch` count failures without changing error isolation. |
+| D5 | Stats keyed by `__qualname__` | Unique per handler method, human-readable in API output. |
+| D6 | Telegram reads from REST API | Single source of truth. `/events/stats` serves both dashboard and bot. |
+
+---
+
+## Acceptance Criteria
+
+- [ ] `EventBusStats` collects event counts, handler stats, recent events ring buffer
+- [ ] `EventBus._dispatch` and `_safe_handle` populate stats on every event
+- [ ] Queue-full events increment `total_dropped`
+- [ ] `GET /events/stats` returns full stats with handler breakdown
+- [ ] `GET /events/recent` returns from ring buffer, newest first
+- [ ] `SessionTimeoutMonitor.get_stats()` exposes tracked sessions + idle times
+- [ ] `SleepHandler.get_stats()` exposes sleep history + current state
+- [ ] `HeartbeatRunner.get_stats()` exposes tick/finding counts
+- [ ] Telegram `/status` includes event bus health summary
+- [ ] Handler errors flagged with ⚠️ when error_rate > 10%
+- [ ] No measurable latency impact on event processing
+- [ ] All tests pass (unit + integration)