Skip to content

Latest commit

 

History

History
75 lines (56 loc) · 2.53 KB

File metadata and controls

75 lines (56 loc) · 2.53 KB

NIMA Benchmark Suite

Four benchmarks covering different aspects of NIMA's cognitive architecture.

Benchmarks

1. synthetic/ — R@K Retrieval Quality (MemPalace-style)

What it tests: Pure retrieval — "is the correct memory in top K?"
Metrics: Recall@1, Recall@5, Recall@10, Recall@20, MRR
Independent of: LLM synthesis, emotional context
Why it matters: Foundation layer — if retrieval fails, nothing else works

Run:

cd ~/.openclaw/workspace/PROJECTS/nima-bench
python3 -c "from synthetic.benchmark import TESTS, run_synthetic_benchmark; run_synthetic_benchmark()"

2. speed_benchmark/ — SQLite vec Extension Performance

What it tests: How fast does NIMA retrieve at scale?
Metrics: Latency (ms) at 1K, 10K, 100K memories
Tests: Vector-only, Vector+Graph, Full RRF pipeline
Why it matters: Personal AI needs to be fast. 100ms or less per query.

Run:

python3 speed_benchmark/benchmark.py --scales 1000 10000 100000

3. emotional_synthesis/ — Emotional Context Understanding

What it tests: Does NIMA understand the EMOTIONAL context of memories?
Metrics: Human-judged 1-5 on: emotional context, factual accuracy, synthesis depth
What makes NIMA different: No other memory system measures this
Why it matters: Raw retrieval wins on synthetic benchmarks. Emotional synthesis is NIMA's edge.

Run (interactive):

python3 emotional_synthesis/benchmark.py

4. personal_continuity/ — Coherent Model of David

What it tests: Does NIMA maintain a consistent, evolving model of David?
Metrics: Human-judged on: consistency, accuracy, drift detection
What it probes: Relationships, belief evolution, stale understanding
Why it matters: A good memory system remembers who you ARE, not just what you said

Run (interactive):

python3 personal_continuity/benchmark.py

Unified Runner

# List all benchmarks
python3 runner.py --list

# Run everything
python3 runner.py --all

# Run specific benchmarks
python3 runner.py --speed --synthetic
python3 runner.py --emotional --continuity

# Save results
python3 runner.py --all --output ./results/2026-04-10

Key Insight

MemPalace proved: raw verbatim + embeddings beats LLM-extracted structured memories on synthetic benchmarks.

NIMA's cognitive layers (affect, synthesis) aren't tested by R@K benchmarks. They're tested by the emotional_synthesis and personal_continuity benchmarks — which require human judgment.

This is NIMA's frontier. Push there.