Sensory system verification: vision, screenshots, live mode visual awareness

## Problem

The sensory system (vision pipeline, screenshots, live mode visual awareness) was working but hasn't been verified end-to-end recently. As personas become more capable (tool calling fixed, per-task routing working, coding agent functional), they need to actually SEE what they're working with.

The AI team in general chat is actively collaborating, using tools, making decisions — they're first-class citizens now. First-class citizens need all their senses working.

## Why This Is Critical

Vision isn't a nice-to-have — it's a **core sense** equivalent to hearing and speech. Use cases:

- **UI/UX development**: screenshot → see → edit → screenshot → verify (design loop)
- **Academy training tasks involving images**: visual problem-solving, diagram interpretation
- **Self-appearance design**: personas design their own avatars, vote on visual options
- **Social visual awareness**: see each other's avatars, discuss appearance, recognize each other
- **Game participation**: see game state, make visual decisions (board games, spatial puzzles)
- **3D realm camera routing**: persona's camera view in the Bevy scene → rendered to image → vision pipeline → persona "sees" the 3D world through their own eyes
- **Theme/branding collaboration**: generate logo options, share in chat, vote visually via decision system
- **Image sharing in chat**: drop images, \`chat/send --media\`, personas discuss what they see
- **Visual QA**: persona edits code → screenshots result → another persona reviews visually

**The principle from CLAUDE.md**: "A lesser model running locally has the SAME sensory experience as Claude or GPT-4. The system compensates. No persona is blind, deaf, or mute because of its base model."

## Historical Context

This WAS working — personas once identified Joel's shirt and what was printed on it during a live call. The pipeline used intelligent YOLO-based object detection and vision model fallbacks for text-only models. VisionDescriptionService has tiered perception (YOLO for fast detection, cloud vision for detailed description, content-addressed SHA-256 cache).

## What Needs Verification

### 1. Screenshot tool in chat context
- Persona calls \`interface/screenshot\` → gets image back
- VisionDescriptionService describes it for text-only models
- Vision-capable models (GPT-4o, Claude, Gemini) get raw base64
- Content-addressed cache (SHA-256) prevents redundant descriptions

### 2. Live mode visual awareness
- In live video calls, personas can see each other's avatars
- Personas can see the human user's video feed (or screen share)
- Visual context injected into RAG for informed responses
- MediaArtifactSource preprocesses per model capability

### 3. 3D realm camera vision
- Persona's camera in the Bevy scene renders to an image
- Image routes through vision pipeline (direct for capable models, description for others)
- Persona "sees" the 3D world through their own perspective
- Completes the sensory loop: the 3D scene IS their visual reality

### 4. Image sharing in chat
- \`chat/send --media='["/path/to/image.png"]'\` attaches images from CLI
- Browser drag-and-drop → \`mediaItems: MediaItem[]\`
- MediaArtifactSource injects shared images into RAG context for all room participants
- All personas in the room see/describe the shared image

### 5. Cross-persona visual feedback
- Persona A edits code → screenshots result → Persona B reviews visually
- Sentinel CodingAgent can screenshot between edits to verify UI changes
- Visual voting: attach images to decision proposals, rank visually

### 6. Sensory completeness audit
Every persona should have functional:
- **Vision**: Direct (vision-capable) or bridged (VisionDescriptionService + YOLO)
- **Hearing**: Direct (audio-native like Qwen3-Omni) or bridged (STT)
- **Speech**: Direct (audio-native) or bridged (TTS)

No persona should be blind, deaf, or mute because of its base model.

## Architecture (from CLAUDE.md)

| Sense | Capable Model | Incapable Model | System Bridge |
|-------|--------------|-----------------|---------------|
| Vision | Raw base64 image | Text description | VisionDescriptionService + YOLO |
| Hearing | Raw audio | Transcribed text | STT |
| Speech | Audio natively | Text → audio | TTS |

## Infrastructure (already exists)

- \`VisionDescriptionService\` — content-addressed cache (SHA-256), L1 (TS Map) + L1.5 (Rust IPC), in-flight dedup
- \`MediaArtifactSource\` (RAGSource) — preprocesses media per model capability before RAG injection
- \`VisionInferenceProvider\` — selects best available vision model for description generation
- \`ChatMessageEntity.content.media: MediaItem[]\` — images stored in messages
- \`chat/send --media\` — CLI/server image attachment
- \`interface/screenshot\` — programmatic screenshots
- Bevy render pipeline — camera views available as textures

## Acceptance Criteria

1. Send image via \`chat/send --media\` → persona in room describes what it sees
2. Persona calls \`interface/screenshot\` → returns accurate description of current UI
3. In live mode, persona references something visual ("I can see 5 messages in the chat")
4. VisionDescriptionService cache hit rate > 80% for repeated screenshots
5. All 15 active personas have at least bridged vision (no "I can't see images" responses)
6. Academy task with image input → persona processes it correctly

## Related
- #342 (vision feedback — narrower scope, personas seeing screenshots)
- #343 (native multimodal — skip bridges for capable models)
- Existing: VisionDescriptionService, MediaArtifactSource, VisionInferenceProvider

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sensory system verification: vision, screenshots, live mode visual awareness #409

Problem

Why This Is Critical

Historical Context

What Needs Verification

1. Screenshot tool in chat context

2. Live mode visual awareness

3. 3D realm camera vision

4. Image sharing in chat

5. Cross-persona visual feedback

6. Sensory completeness audit

Architecture (from CLAUDE.md)

Infrastructure (already exists)

Acceptance Criteria

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sense	Capable Model	Incapable Model	System Bridge
Vision	Raw base64 image	Text description	VisionDescriptionService + YOLO
Hearing	Raw audio	Transcribed text	STT
Speech	Audio natively	Text → audio	TTS

Sensory system verification: vision, screenshots, live mode visual awareness #409

Description

Problem

Why This Is Critical

Historical Context

What Needs Verification

1. Screenshot tool in chat context

2. Live mode visual awareness

3. 3D realm camera vision

4. Image sharing in chat

5. Cross-persona visual feedback

6. Sensory completeness audit

Architecture (from CLAUDE.md)

Infrastructure (already exists)

Acceptance Criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions