Skip to content

Episode Memories: custom eval for episode-level retrieval #191

@abbudjoe

Description

@abbudjoe

Context

Episode Memories (Issue #190) introduces write-time narrative synthesis — detecting episodes across sessions and generating progressive summaries that surface via standard vector search.

Task

As part of Phase 4 (MCP Tools & Polish), create a custom evaluation suite for episode-level retrieval:

Scenarios to test:

  1. House-hunting scenario (from design doc) — 5 sessions, cross-session activity, aggregation query ("How many properties before Brookside?")
  2. Job search scenario — applications, interviews, offers across sessions
  3. Debugging episode — production incident spanning multiple sessions with different services
  4. Trip planning — flights, hotels, activities discussed across sessions

Metrics:

  • Episode summary surfaces in top-3 recall results for aggregation queries
  • Constituent memories are reachable via graph expansion from episode summary
  • Queries that don't embed near individual memories DO embed near episode summaries
  • Progressive summary accuracy vs full regeneration (drift detection)

Success criteria:

  • All scenarios pass with episode summaries surfacing correctly
  • No regression on existing LoCoMo/LongMemEval benchmarks
  • Latency: remember() adds <10ms synchronous overhead (async detection is unbounded)

Part of

Episode Memories epic (#190), Phase 4

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions