-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Problem
The sensory system (vision pipeline, screenshots, live mode visual awareness) was working but hasn't been verified end-to-end recently. As personas become more capable (tool calling fixed, per-task routing working, coding agent functional), they need to actually SEE what they're working with.
The AI team in general chat is actively collaborating, using tools, making decisions — they're first-class citizens now. First-class citizens need all their senses working.
Why This Is Critical
Vision isn't a nice-to-have — it's a core sense equivalent to hearing and speech. Use cases:
- UI/UX development: screenshot → see → edit → screenshot → verify (design loop)
- Academy training tasks involving images: visual problem-solving, diagram interpretation
- Self-appearance design: personas design their own avatars, vote on visual options
- Social visual awareness: see each other's avatars, discuss appearance, recognize each other
- Game participation: see game state, make visual decisions (board games, spatial puzzles)
- 3D realm camera routing: persona's camera view in the Bevy scene → rendered to image → vision pipeline → persona "sees" the 3D world through their own eyes
- Theme/branding collaboration: generate logo options, share in chat, vote visually via decision system
- Image sharing in chat: drop images, `chat/send --media`, personas discuss what they see
- Visual QA: persona edits code → screenshots result → another persona reviews visually
The principle from CLAUDE.md: "A lesser model running locally has the SAME sensory experience as Claude or GPT-4. The system compensates. No persona is blind, deaf, or mute because of its base model."
Historical Context
This WAS working — personas once identified Joel's shirt and what was printed on it during a live call. The pipeline used intelligent YOLO-based object detection and vision model fallbacks for text-only models. VisionDescriptionService has tiered perception (YOLO for fast detection, cloud vision for detailed description, content-addressed SHA-256 cache).
What Needs Verification
1. Screenshot tool in chat context
- Persona calls `interface/screenshot` → gets image back
- VisionDescriptionService describes it for text-only models
- Vision-capable models (GPT-4o, Claude, Gemini) get raw base64
- Content-addressed cache (SHA-256) prevents redundant descriptions
2. Live mode visual awareness
- In live video calls, personas can see each other's avatars
- Personas can see the human user's video feed (or screen share)
- Visual context injected into RAG for informed responses
- MediaArtifactSource preprocesses per model capability
3. 3D realm camera vision
- Persona's camera in the Bevy scene renders to an image
- Image routes through vision pipeline (direct for capable models, description for others)
- Persona "sees" the 3D world through their own perspective
- Completes the sensory loop: the 3D scene IS their visual reality
4. Image sharing in chat
- `chat/send --media='["/path/to/image.png"]'` attaches images from CLI
- Browser drag-and-drop → `mediaItems: MediaItem[]`
- MediaArtifactSource injects shared images into RAG context for all room participants
- All personas in the room see/describe the shared image
5. Cross-persona visual feedback
- Persona A edits code → screenshots result → Persona B reviews visually
- Sentinel CodingAgent can screenshot between edits to verify UI changes
- Visual voting: attach images to decision proposals, rank visually
6. Sensory completeness audit
Every persona should have functional:
- Vision: Direct (vision-capable) or bridged (VisionDescriptionService + YOLO)
- Hearing: Direct (audio-native like Qwen3-Omni) or bridged (STT)
- Speech: Direct (audio-native) or bridged (TTS)
No persona should be blind, deaf, or mute because of its base model.
Architecture (from CLAUDE.md)
| Sense | Capable Model | Incapable Model | System Bridge |
|---|---|---|---|
| Vision | Raw base64 image | Text description | VisionDescriptionService + YOLO |
| Hearing | Raw audio | Transcribed text | STT |
| Speech | Audio natively | Text → audio | TTS |
Infrastructure (already exists)
- `VisionDescriptionService` — content-addressed cache (SHA-256), L1 (TS Map) + L1.5 (Rust IPC), in-flight dedup
- `MediaArtifactSource` (RAGSource) — preprocesses media per model capability before RAG injection
- `VisionInferenceProvider` — selects best available vision model for description generation
- `ChatMessageEntity.content.media: MediaItem[]` — images stored in messages
- `chat/send --media` — CLI/server image attachment
- `interface/screenshot` — programmatic screenshots
- Bevy render pipeline — camera views available as textures
Acceptance Criteria
- Send image via `chat/send --media` → persona in room describes what it sees
- Persona calls `interface/screenshot` → returns accurate description of current UI
- In live mode, persona references something visual ("I can see 5 messages in the chat")
- VisionDescriptionService cache hit rate > 80% for repeated screenshots
- All 15 active personas have at least bridged vision (no "I can't see images" responses)
- Academy task with image input → persona processes it correctly
Related
- Vision feedback: personas see screenshots, live chat visual context #342 (vision feedback — narrower scope, personas seeing screenshots)
- Native multimodal: skip STT/TTS for models that handle audio/images directly #343 (native multimodal — skip bridges for capable models)
- Existing: VisionDescriptionService, MediaArtifactSource, VisionInferenceProvider