Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions docs/features/F035-observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# F035: Observability — Knowing What the Mind Is Doing

**Status:** PROPOSED
**Author:** Nous + Tim
**Created:** 2026-04-04
**Dependencies:** F006 (Event Bus), F034 (Heartbeat), F026 (Execution Ledger)

---

## Problem

Nous now has 4+ autonomous subsystems that modify state without human initiation:

- **Heartbeat** (F034) — monitors external services, creates findings, triggers triage
- **Sleep consolidation** — rewrites memory (compaction, reflection, contradiction resolution)
- **Self-tuning** (F034.3) — adjusts heartbeat check intervals based on yield
- **Fact/episode lifecycle** — admission, dedup, pruning all happen automatically

Each system is individually well-designed. But when something unexpected happens — a fact disappears, a check stops running, behavior shifts — there's no way to answer: **"what chain of autonomous decisions led here?"**

The event bus exists (F006) and persists events to DB, but there are no processing stats, no causal chains linking autonomous actions, and no trend detection for behavioral drift. Debugging requires reading raw logs and manually reconstructing causality.

This is the difference between a system that *works* and a system you can *trust*. As Nous becomes more autonomous, observability isn't optional — it's the mechanism for accountable self-modification.

---

## Design Philosophy

The right mental model isn't "monitoring a server." It's closer to **journaling for a mind** — the system should be able to answer "why did I change my mind about X?" the same way a person with good self-awareness can.

Minsky's Chapter 6 ("Self-Knowledge is Dangerous") warns that unrestricted self-modification leads to instability. The observability layer is the read-only self-knowledge that makes self-modification safe: you can see what happened and why, but the audit trail itself can't be modified by the systems it monitors.

---

## Architecture — Three Layers

```
┌─────────────────────────────────────────────────────────┐
│ Dashboard / API │
│ (query any layer, visualize trends) │
└────────────┬──────────────────┬──────────────────┬───────┘
│ │ │
┌────────▼────────┐ ┌──────▼───────┐ ┌────────▼────────┐
│ F035.1 │ │ F035.2 │ │ F035.3 │
│ Event Bus │ │ Causal │ │ Behavioral │
│ Stats │ │ Chains │ │ Drift │
│ │ │ │ │ Detection │
│ "Is the system │ │ "Why did │ │ "Is the system │
│ healthy now?" │ │ this happen?"│ │ changing?" │
└─────────────────┘ └──────────────┘ └─────────────────┘
▲ ▲ ▲
│ │ │
┌────────┴──────────────────┴──────────────────┴───────┐
│ Event Bus (F006) │
│ All events flow through here │
└──────────────────────────────────────────────────────┘
```

**Layer 1 — Event Bus Stats (F035.1):** Real-time operational health. Event throughput, handler success/fail rates, queue depth. "Is the system healthy right now?"

**Layer 2 — Causal Chain Tracing (F035.2):** Every autonomous action gets a `caused_by` link back to its trigger. Queryable audit trail. "Why did Nous do X?"

**Layer 3 — Behavioral Drift Detection (F035.3):** Periodic snapshots of key metrics with trend analysis. Catches slow drift that individual events don't reveal. "Is the system changing in ways nobody intended?"

---

## Sub-Specs

| Spec | Title | Priority | Depends On |
|------|-------|----------|------------|
| F035.1 | Event Bus Observability | P1 | F006 |
| F035.2 | Autonomous Action Audit Trail | P1 | F035.1 |
| F035.3 | Behavioral Drift Detection | P2 | F035.2 |

**Sequencing rationale:** F035.1 gives us the infrastructure (stats collection, endpoints). F035.2 adds causal metadata to events. F035.3 builds on both to detect trends. Each is independently useful.

---

## What This Supersedes

- **006.1 (Event Bus Observability)** — F035.1 absorbs and modernizes this planned spec. The original 006.1 was scoped before heartbeat, sleep consolidation, and self-tuning existed. F035.1 covers the same ground but accounts for the current architecture.

---

## Success Criteria

1. After any autonomous action, you can trace the full causal chain back to the originating trigger
2. Handler health (success/fail rates) is visible in real-time via API and Telegram
3. Behavioral trends (fact growth rate, censor changes, check frequency drift) are tracked and anomalies flagged
4. The dashboard has an "Autonomous Activity" panel showing recent self-modifications
5. None of this adds measurable latency to the hot path (event processing)
242 changes: 242 additions & 0 deletions docs/features/F035.1-event-bus-observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
# F035.1 — Event Bus Observability

**Status:** PROPOSED
**Priority:** P1 (foundation — F035.2 and F035.3 build on this)
**Depends on:** F006 (Event Bus), F035 (Observability umbrella)
**Supersedes:** 006.1 (Event Bus Observability — planned, never built)
**Author:** Nous + Tim
**Date:** 2026-04-04

---

## Problem

The event bus processes all system events but has zero operational visibility. The only insight into event processing is:

- `bus.pending` — queue depth (single int)
- `logger` calls scattered across handlers
- Raw rows in `nous_system.events` (unindexed, no aggregation)

You can't answer basic questions:
- How many events processed in the last hour?
- Which handlers are failing? How often?
- Is the queue backing up?
- When did the sleep handler last run successfully?
- Are any handlers silently broken?

---

## Goals

1. **In-memory stats** — event counts by type, handler success/fail/timing, queue health
2. **Ring buffer** — last N events for quick inspection without DB queries
3. **REST endpoints** — `/events/stats` and `/events/recent` for API/dashboard access
4. **Telegram integration** — event bus health in `/status` output
5. **Handler-level stats** — per-handler invocation counts, error rates, avg latency
6. **Zero hot-path cost** — counter increments only, no allocations on the event path

---

## Design

### 1. EventBusStats Class

New class in `nous/events.py`. Purely in-memory — no DB writes on the hot path.

```python
@dataclass
class HandlerStat:
name: str
invocations: int = 0
successes: int = 0
errors: int = 0
last_invoked: float | None = None # time.monotonic()
last_error: float | None = None
last_error_msg: str | None = None
total_duration_ms: float = 0.0

@dataclass
class RecentEvent:
type: str
timestamp: str # ISO format
handlers_invoked: int
handlers_failed: int
duration_ms: float
session_id: str | None = None

class EventBusStats:
def __init__(self, recent_limit: int = 100):
self._event_counts: dict[str, int] = defaultdict(int)
self._handler_stats: dict[str, HandlerStat] = {}
self._recent: deque[RecentEvent] = deque(maxlen=recent_limit)
self._total_processed: int = 0
self._total_dropped: int = 0
self._started_at: float = time.monotonic()
```

**Key methods:**
- `record_event(event, handlers_invoked, handlers_failed, duration_ms)` — called after dispatch
- `record_handler_success(handler_name, duration_ms)` — called per handler
- `record_handler_error(handler_name, error_msg)` — called on handler failure
- `record_drop()` — called when queue is full
- `to_dict()` → serializable stats for API
- `recent_events(limit)` → newest-first list from ring buffer

### 2. EventBus Wiring

Modify `EventBus` to populate stats:

- **`__init__`** — create `self.stats = EventBusStats()`
- **`emit()`** — on `QueueFull`, call `self.stats.record_drop()`
- **`_dispatch()`** — time the full dispatch, count handler results, call `record_event()`
- **`_safe_handle()`** — return `bool` (success/fail), call `record_handler_success()` or `record_handler_error()` with timing

The existing error isolation behavior is unchanged — `_safe_handle` still catches all exceptions. The only addition is recording the outcome.

### 3. REST Endpoints

**`GET /events/stats`**
```json
{
"uptime_seconds": 3600,
"total_processed": 142,
"total_dropped": 0,
"queue_depth": 0,
"event_counts": {
"turn_completed": 85,
"heartbeat_tick": 24,
"episode_completed": 8,
"heartbeat_triage": 3,
"sleep_completed": 1
},
"handlers": {
"SessionTimeoutMonitor.on_activity": {
"invocations": 85,
"successes": 85,
"errors": 0,
"error_rate": 0.0,
"last_invoked_ago_s": 120,
"avg_duration_ms": 0.1
},
"EpisodeSummarizer.handle": {
"invocations": 12,
"successes": 8,
"errors": 4,
"error_rate": 0.33,
"last_error_ago_s": 600,
"last_error_msg": "LLM call failed: 429",
"avg_duration_ms": 2500.0
}
}
}
```

**`GET /events/recent?limit=20`**
Returns from in-memory ring buffer (fast, no DB). Newest first.

```json
{
"events": [
{
"type": "turn_completed",
"timestamp": "2026-04-04T17:30:00Z",
"handlers_invoked": 2,
"handlers_failed": 0,
"duration_ms": 3.2
}
],
"source": "memory",
"count": 20
}
```

### 4. Handler Stats Exposure

Each major handler gets a `get_stats()` method returning its internal state:

**SessionTimeoutMonitor:**
- tracked_sessions count + per-session idle time
- sleep_emitted flag
- global idle seconds

**SleepHandler:**
- total_sleeps, last_sleep_at, last_phases_completed
- currently_sleeping flag, last_interrupted

**HeartbeatRunner:**
- total_ticks, total_findings, checks_run_by_name
- last_tick_at, currently_running

These are included in the `/events/stats` response under their own keys.

### 5. Telegram Integration

Extend `/status` output with event bus section:

```
📡 Event Bus
142 events processed, 0 dropped
Queue: 0 pending | Uptime: 6h 30m

Handlers:
✅ SessionMonitor: 85/85
⚠️ EpisodeSummarizer: 8/12 (33% errors)
✅ FactExtractor: 7/8
✅ SleepHandler: 1/1
✅ HeartbeatRunner: 24 ticks, 3 findings
```

Error flag (⚠️) shown when error_rate > 10%.

### 6. Dashboard Panel (optional, P2)

New panel in the Memory Dashboard: "System Health"
- Event throughput chart (events/minute over last hour)
- Handler health table (success rate, avg latency)
- Queue depth gauge
- Reuses `/events/stats` endpoint

---

## Files Changed

| File | Change | ~Lines |
|------|--------|--------|
| `nous/events.py` | `EventBusStats`, `HandlerStat`, `RecentEvent` + wiring | +100 |
| `nous/api/rest.py` | `GET /events/stats`, `GET /events/recent` | +60 |
| `nous/handlers/session_monitor.py` | `get_stats()` | +15 |
| `nous/handlers/sleep_handler.py` | tracking fields + `get_stats()` | +20 |
| `nous/heartbeat/runner.py` | `get_stats()` | +15 |
| `nous/telegram_bot.py` | event bus section in `/status` | +30 |
| `tests/test_event_bus_observability.py` | **NEW** | +200 |
| **Total** | | ~440 |

---

## Design Decisions

| # | Decision | Rationale |
|---|----------|-----------|
| D1 | In-memory stats, not DB | Hot-path cost must be zero. Counter increments are O(1). |
| D2 | Ring buffer (deque, maxlen=100) | Fixed memory (~10KB), O(1) append, no unbounded growth. |
| D3 | `time.monotonic()` for timing | Wall clock can shift; monotonic doesn't. "Ago" times are relative. |
| D4 | `_safe_handle` returns bool | Lets `_dispatch` count failures without changing error isolation. |
| D5 | Stats keyed by `__qualname__` | Unique per handler method, human-readable in API output. |
| D6 | Telegram reads from REST API | Single source of truth. `/events/stats` serves both dashboard and bot. |

---

## Acceptance Criteria

- [ ] `EventBusStats` collects event counts, handler stats, recent events ring buffer
- [ ] `EventBus._dispatch` and `_safe_handle` populate stats on every event
- [ ] Queue-full events increment `total_dropped`
- [ ] `GET /events/stats` returns full stats with handler breakdown
- [ ] `GET /events/recent` returns from ring buffer, newest first
- [ ] `SessionTimeoutMonitor.get_stats()` exposes tracked sessions + idle times
- [ ] `SleepHandler.get_stats()` exposes sleep history + current state
- [ ] `HeartbeatRunner.get_stats()` exposes tick/finding counts
- [ ] Telegram `/status` includes event bus health summary
- [ ] Handler errors flagged with ⚠️ when error_rate > 10%
- [ ] No measurable latency impact on event processing
- [ ] All tests pass (unit + integration)
Loading
Loading