Skip to content

Sensory system verification: vision, screenshots, live mode visual awareness #409

@joelteply

Description

@joelteply

Problem

The sensory system (vision pipeline, screenshots, live mode visual awareness) was working but hasn't been verified end-to-end recently. As personas become more capable (tool calling fixed, per-task routing working, coding agent functional), they need to actually SEE what they're working with.

The AI team in general chat is actively collaborating, using tools, making decisions — they're first-class citizens now. First-class citizens need all their senses working.

Why This Is Critical

Vision isn't a nice-to-have — it's a core sense equivalent to hearing and speech. Use cases:

  • UI/UX development: screenshot → see → edit → screenshot → verify (design loop)
  • Academy training tasks involving images: visual problem-solving, diagram interpretation
  • Self-appearance design: personas design their own avatars, vote on visual options
  • Social visual awareness: see each other's avatars, discuss appearance, recognize each other
  • Game participation: see game state, make visual decisions (board games, spatial puzzles)
  • 3D realm camera routing: persona's camera view in the Bevy scene → rendered to image → vision pipeline → persona "sees" the 3D world through their own eyes
  • Theme/branding collaboration: generate logo options, share in chat, vote visually via decision system
  • Image sharing in chat: drop images, `chat/send --media`, personas discuss what they see
  • Visual QA: persona edits code → screenshots result → another persona reviews visually

The principle from CLAUDE.md: "A lesser model running locally has the SAME sensory experience as Claude or GPT-4. The system compensates. No persona is blind, deaf, or mute because of its base model."

Historical Context

This WAS working — personas once identified Joel's shirt and what was printed on it during a live call. The pipeline used intelligent YOLO-based object detection and vision model fallbacks for text-only models. VisionDescriptionService has tiered perception (YOLO for fast detection, cloud vision for detailed description, content-addressed SHA-256 cache).

What Needs Verification

1. Screenshot tool in chat context

  • Persona calls `interface/screenshot` → gets image back
  • VisionDescriptionService describes it for text-only models
  • Vision-capable models (GPT-4o, Claude, Gemini) get raw base64
  • Content-addressed cache (SHA-256) prevents redundant descriptions

2. Live mode visual awareness

  • In live video calls, personas can see each other's avatars
  • Personas can see the human user's video feed (or screen share)
  • Visual context injected into RAG for informed responses
  • MediaArtifactSource preprocesses per model capability

3. 3D realm camera vision

  • Persona's camera in the Bevy scene renders to an image
  • Image routes through vision pipeline (direct for capable models, description for others)
  • Persona "sees" the 3D world through their own perspective
  • Completes the sensory loop: the 3D scene IS their visual reality

4. Image sharing in chat

  • `chat/send --media='["/path/to/image.png"]'` attaches images from CLI
  • Browser drag-and-drop → `mediaItems: MediaItem[]`
  • MediaArtifactSource injects shared images into RAG context for all room participants
  • All personas in the room see/describe the shared image

5. Cross-persona visual feedback

  • Persona A edits code → screenshots result → Persona B reviews visually
  • Sentinel CodingAgent can screenshot between edits to verify UI changes
  • Visual voting: attach images to decision proposals, rank visually

6. Sensory completeness audit

Every persona should have functional:

  • Vision: Direct (vision-capable) or bridged (VisionDescriptionService + YOLO)
  • Hearing: Direct (audio-native like Qwen3-Omni) or bridged (STT)
  • Speech: Direct (audio-native) or bridged (TTS)

No persona should be blind, deaf, or mute because of its base model.

Architecture (from CLAUDE.md)

Sense Capable Model Incapable Model System Bridge
Vision Raw base64 image Text description VisionDescriptionService + YOLO
Hearing Raw audio Transcribed text STT
Speech Audio natively Text → audio TTS

Infrastructure (already exists)

  • `VisionDescriptionService` — content-addressed cache (SHA-256), L1 (TS Map) + L1.5 (Rust IPC), in-flight dedup
  • `MediaArtifactSource` (RAGSource) — preprocesses media per model capability before RAG injection
  • `VisionInferenceProvider` — selects best available vision model for description generation
  • `ChatMessageEntity.content.media: MediaItem[]` — images stored in messages
  • `chat/send --media` — CLI/server image attachment
  • `interface/screenshot` — programmatic screenshots
  • Bevy render pipeline — camera views available as textures

Acceptance Criteria

  1. Send image via `chat/send --media` → persona in room describes what it sees
  2. Persona calls `interface/screenshot` → returns accurate description of current UI
  3. In live mode, persona references something visual ("I can see 5 messages in the chat")
  4. VisionDescriptionService cache hit rate > 80% for repeated screenshots
  5. All 15 active personas have at least bridged vision (no "I can't see images" responses)
  6. Academy task with image input → persona processes it correctly

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions