Skip to content

Bug: Working memory never cleaned up + sleep still not firing after PR #161 #166

@tfatykhov

Description

@tfatykhov

Problem

Sleep has still never fired (0 sleep_reflection facts) despite PR #161 being merged and container restarted.

Root Cause Analysis

Issue 1: Working Memory Accumulates Forever

heart.working_memory had 211 entries dating back to Feb 24 — test sessions, subtasks, debug sessions, backfill jobs. These are never cleaned up.

While the session monitor uses an in-memory dict (not the DB table) for tracking, the /status endpoint reads from this table and reports all entries as "active sessions." This is misleading and makes debugging harder.

Fix needed: TTL-based cleanup. Delete working_memory entries older than N hours (e.g. 24h). Could be:

  • Periodic task in session_monitor
  • Part of the sleep handler
  • DB-level: ON DELETE CASCADE or scheduled cleanup

Issue 2: Session Lifecycle - No Explicit Close

Sessions are never explicitly ended. The timeout path works like this:

  1. User sends message → message_received event → monitor tracks in _last_activity
  2. User stops talking
  3. After 30 min (session_idle_timeout): monitor calls cognitive.end_session() → removes from _last_activity
  4. After 2h global idle (sleep_timeout): monitor emits sleep_started

But: The monitor only tracks sessions it has seen events for since the last restart. On restart, _last_activity is empty, _global_last_activity is set to time.monotonic() (current time). If no messages come in, global_idle starts counting from restart time.

So after restart, sleep should fire after 2h of silence. But Tim reports 8+ hours of inactivity with no sleep.

Possible Remaining Causes

  1. Container not running the fix — No version endpoint to verify. The fix is on main (ba4def0) but the deployed container may be on an older commit.
  2. Something emitting events internally — If any internal process (e.g. subtask scheduler, background task) emits turn_completed or message_received, it resets the timers.
  3. Crash in sleep handler — If sleep_started fires but the handler crashes, sleep_reflection facts are never created and it looks like sleep never happened.

Recommended Fixes

  1. Add working_memory TTL cleanup to session monitor _check_timeouts():

    # Clean working_memory entries older than 24h
    await self._heart.cleanup_stale_working_memory(max_age_hours=24)
  2. Add /status version info — expose git SHA or version so we can verify deployments:

    {"version": "0.2.0", "commit": "ba4def0", "started_at": "..."}
  3. Add sleep handler logging — ensure sleep_started handler logs entry/exit so we can verify it ran.

  4. Consider hydrating _last_activity from DB on startup — query working_memory updated_at to pre-populate with real timestamps instead of starting empty.

Manual Workaround Applied

Cleaned up 209 stale working_memory entries via SQL:

DELETE FROM heart.working_memory WHERE updated_at < NOW() - INTERVAL '24 hours';

Related: #160, PR #161

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions