Any media type in, any media type out, for ANY AI. The system is a multimodal bridge that gives every model — from a blind 0.8B text model to a cloud multimodal giant — equal access to every sense. Like accessibility aids for the visually impaired: the infrastructure provides what the model lacks.
No model is truly blind, deaf, or mute in Continuum. The system provides universal sensory adapters that translate between modalities. A text-only model can see a game, hear a conversation, and speak aloud — because the infrastructure handles every translation step.
ANY INPUT THE BRIDGE ANY OUTPUT
───────── ────────── ──────────
Webcam video ──┐ ┌── Synthesized speech (TTS)
Screen share ──┤ ┌─────────────────────┐ ├── Written text
Game capture ──┤ │ ML Adapter Layer │ ├── Tool actions (clicks, keys)
3D scene render ──┤ │ │ ├── Code edits
iPhone/ADB ──┼────▶│ YOLO / Classifier │──────├── Avatar animation
Microphone ──┤ │ STT (Whisper) │ ├── Game controls
Chat messages ──┤ │ TTS │ ├── API calls
Code files ──┤ │ VisionDescription │ ├── Canvas drawing
Documents ──┘ │ Semantic Segmenter │ └── File writes
└─────────────────────┘
│
Recipe controls
which senses
are active
The ML adapter layer is not overhead — it's the accessibility layer. It makes the entire model ecosystem usable. A $0 local 0.8B model has the same sensory capabilities as a $200/mo cloud multimodal model. The difference is reasoning quality, not perception.
Every sensory input goes through the same tiered pipeline, parallel across modalities:
| Tier | Audio | Visual | Cost | Who Consumes |
|---|---|---|---|---|
| T0 (detect) | VAD (voice activity) | YOLO / semantic classifier | Cheap, every frame | Triggers + structured metadata for ALL models |
| T1 (transcribe) | STT → text transcript | VisionDescriptionService → text description | Medium, on trigger | Text-only ("blind/deaf") models |
| T2 (raw) | PCM audio stream | Raw JPEG frame | Full, by choice | Natively capable models |
Key properties:
- Lazy: Higher tiers only trigger when a consumer needs them
- Cached: One inference serves all consumers (content-addressed dedup)
- Shared: 14 personas in a room = 1 YOLO call, 1 VisionDescription call
- Recipe-driven: Each persona's recipe declares which tiers to activate
T0 runs in Rust on every frame/sample. It produces structured metadata that ANY model can consume — no vision or audio capability required.
// Visual T0: YOLO + semantic classification
pub struct SceneMetadata {
pub objects: Vec<DetectedObject>, // [{type: "person", bbox, confidence}]
pub scene_type: String, // "video_game", "ide", "video_call"
pub semantic_labels: Vec<String>, // ["combat", "health_bar_low", "forest"]
pub change_magnitude: f32, // How much changed since last frame
pub timestamp_ms: u64,
}
// Audio T0: VAD + speaker identification
pub struct AudioMetadata {
pub is_speech: bool,
pub speaker_id: Option<String>,
pub energy_db: f32,
pub timestamp_ms: u64,
}A text-only DeepSeek receiving SceneMetadata knows you're in a boss fight with low health
just as well as Claude with raw vision. T0 is the great equalizer.
But T0 isn't just for the blind — it empowers vision models too. A model that can already see raw frames gets YOLO bounding boxes, semantic labels, surface normals, depth maps, heatmaps overlaid. That's super vision — seeing more than the raw pixels. Like giving a sighted person infrared or X-ray:
- Raw frame + YOLO boxes = "I see the screen AND I know exactly where every UI element is"
- Raw frame + depth map = "I see the scene AND I know the 3D geometry"
- Raw frame + attention heatmap = "I see the game AND I know where the player is looking"
- Raw frame + semantic segmentation = "I see the photo AND I know which pixels are sky/road/car"
The adapter layer isn't just accessibility — it's augmentation. Every model gets capabilities beyond its native training. Blind models can see. Sighted models get superhuman perception. The system elevates everyone.
T1 generates natural language descriptions. Triggered by T0 (scene change detected) or by consumer request. Cached and shared across all consumers.
T0 detects scene change → triggers T1
VisionDescriptionService.describeBase64(frame)
→ "Player character in a dark forest clearing, fighting a red dragon.
Health bar shows 60%. Two potions in inventory. UI shows quest tracker."
Cached by content hash. 14 personas = 1 API call.
Vision-capable models can opt into raw frames. Audio-native models can process PCM. This is the highest-fidelity tier — but also the most expensive (context window tokens).
Recipe controls T2 activation:
{
"visualSources": ["livekit-feed"],
"visualTier": "raw", // T2: raw frames in context
"visualBudgetPercent": 15 // Max 15% of context window for images
}Visual perception is a RAG source type — same pattern as text, code, conversation. Each adapter implements the same interface. Recipes compose them. They're traded like LoRA layers and personas.
| Adapter | Source | Use Case |
|---|---|---|
LiveKitFeedAdapter |
LiveKit video tracks | Video calls — webcam, screen share, AI avatars |
BevySceneAdapter |
Bevy 3D render | Avatar scenes, 3D environments |
ScreenCaptureAdapter |
Desktop/window capture | IDE, browser, any application |
GameCaptureAdapter |
Game window | Playing/testing games |
ADBDeviceAdapter |
Android Debug Bridge | Mobile app testing |
iOSSimulatorAdapter |
Xcode Simulator | iOS app testing |
Each adapter produces RAGArtifact with type image. The tiered perception pipeline
processes them identically regardless of source. The recipe declares which adapters
are active.
First working adapter. Captures video frames from LiveKit room participants at 1fps.
LiveKit Room
│
├─ Human webcam tracks
├─ Human screen share tracks
├─ AI persona avatar renders (Bevy → LiveKit)
│
▼
VideoFrameCapture (Rust singleton)
│
├─ I420 → RGB → JPEG conversion
├─ Content-addressed dedup (skip identical frames)
├─ 1fps rate limit
├─ Grid composition (all participants → single image)
│
▼
IPC Commands
├─ voice/snapshot-room → composite grid JPEG
└─ voice/snapshot-participant → individual JPEG
Same pattern. Already largely implemented via the STT/TTS pipeline:
| Adapter | Source | Output |
|---|---|---|
LiveKitAudioAdapter |
LiveKit audio tracks | PCM → Whisper STT → transcript |
MicrophoneAdapter |
Local mic | Direct audio capture |
SystemAudioAdapter |
Desktop audio | Game sounds, music, notifications |
Every sense has a corresponding action. The recipe controls both sides:
| Sense (Input) | Action (Output) | Domain |
|---|---|---|
| See video/screen | Click, type, navigate | UI interaction |
| Hear speech | Speak via TTS | Voice |
| Read chat | Send message | Chat |
| See code | Edit code | Development |
| See game | Press buttons, move | Gaming |
| See mobile app | Tap, swipe, type | Mobile testing |
Recipes are the control plane for sensory capabilities:
{
"name": "game-tester",
"ragTemplate": {
"sources": [
"conversation-history",
"persona-identity",
"game-perception"
],
"visualSources": ["game-capture"],
"visualTier": "t0+t1",
"enableYolo": true,
"yoloModel": "yolov8-gaming",
"sceneChangeThreshold": 0.2
},
"tools": [
"interface/click",
"interface/type",
"interface/key-press",
"interface/screenshot"
]
}{
"name": "mobile-qa",
"ragTemplate": {
"sources": [
"conversation-history",
"persona-identity",
"device-perception"
],
"visualSources": ["adb-device"],
"visualTier": "t0+t1",
"enableYolo": true,
"yoloModel": "yolov8-ui-elements"
},
"tools": [
"device/tap",
"device/swipe",
"device/type",
"device/screenshot"
]
}{
"name": "live-call-observer",
"ragTemplate": {
"sources": [
"conversation-history",
"persona-identity",
"voice-conversation",
"live-perception"
],
"visualSources": ["livekit-feed"],
"visualTier": "raw",
"audioSources": ["livekit-audio"]
},
"tools": [
"voice/synthesize",
"voice/snapshot-room",
"voice/snapshot-participant"
]
}The persona is the central brain — high-level reasoning, planning, intent. Sentinels are the arms — each with enough local neural matter (a cheap LoRA-trained model) to execute independently on general instructions. Like an octopus: the brain says "get food from that crevice," and the arm figures out the motor control on its own.
Central Brain (persona — large model, slow, thoughtful)
"Navigate to the castle and find the key"
│
├── Vision Sentinel (0.8B + YOLO LoRA, 50ms response)
│ Captures frames, classifies objects, reports scene changes
│ "Door ahead. Locked. Keyhole visible."
│
├── Navigation Sentinel (0.8B + pathfinding LoRA, 50ms response)
│ Handles movement, obstacle avoidance, pathfinding
│ Executes WASD/mouse inputs, gets to the door
│
├── Interaction Sentinel (0.8B + UI LoRA, 50ms response)
│ Clicks buttons, opens menus, uses inventory items
│ "Opened inventory. Selected lockpick. Used on door."
│
└── Audio Sentinel (0.8B + STT LoRA, 100ms response)
Monitors dialogue, filters noise, alerts brain
"NPC said: 'The key is hidden upstairs in the library.'"
The brain doesn't micromanage every keystroke or frame — it delegates intent. The arms have enough intelligence to handle execution details. This is the peripheral nervous system: cheap, fast, specialized, semi-autonomous.
A tool call is synchronous and stateless: "take a screenshot." A sentinel is a running process with state, perception, and agency:
| Aspect | Tool Call | Sentinel Arm |
|---|---|---|
| Stateful | No | Yes — tracks scene, remembers path, accumulates context |
| Autonomous | No — waits for invocation | Yes — acts on general instructions |
| Adaptive | No — same behavior every time | Yes — LoRA-trained for specific domains |
| Parallel | Sequential tool calls | Multiple arms run simultaneously |
| Reactive | Only when called | Monitors continuously, alerts brain on events |
A vision sentinel doesn't just take one screenshot — it continuously monitors the visual field, detects changes, classifies objects, and alerts the brain only when something relevant happens. It's running its own perception loop.
Each sentinel arm can be LoRA-trained for its specific function:
| Arm Type | Base Model | LoRA Training Data | Specialty |
|---|---|---|---|
| Vision arm | Qwen3.5-0.8B | YOLO annotations + scene descriptions | Fast visual classification |
| Navigation arm | Qwen3.5-0.8B | Movement sequences + pathfinding traces | Spatial reasoning, obstacle avoidance |
| UI arm | Qwen3.5-0.8B | Click/type sequences + UI element labels | Interface manipulation |
| Audio arm | Qwen3.5-0.8B | Dialogue transcripts + noise filtering | Speech monitoring and summarization |
| Code arm | Qwen3.5-2B | Code edits + test results | Surgical code changes |
Train a navigation arm on Minecraft → it learns that world's physics. Swap the same arm's LoRA to a Terraria adapter → it navigates 2D instead. The sentinel architecture is the same; only the training data changes.
This architecture maps directly to physical systems:
Robotics (same pattern, different adapters):
Central Brain (persona — planning, reasoning)
"Pick up the red cup and place it on the shelf"
│
├── Vision Sentinel (camera feed + YOLO + depth)
│ Object detection, pose estimation, distance calc
│
├── Manipulation Sentinel (motor control LoRA)
│ Inverse kinematics, grip force, trajectory planning
│
├── Navigation Sentinel (LIDAR + pathfinding LoRA)
│ Obstacle avoidance, SLAM, path execution
│
└── Safety Sentinel (force/torque monitoring)
Emergency stop, collision detection, human proximity
The software sentinels we build today for games and UI testing are the same architecture that drives physical robots tomorrow. The visual source adapters become camera feeds. The action adapters become motor controllers. The LoRA training pipeline is identical — train on simulation, deploy to hardware.
A kid training their game-playing AI today is learning the same architecture that runs surgical robots. The complexity scales; the pattern doesn't change.
The octopus arms aren't only top-down (smart brain → dumb arms). Delegation flows in both directions. Dumb agents escalate up when they need more intelligence. Smart agents delegate down when they need more parallelism.
Intelligence Hierarchy (any node delegates to any other):
Tier 3: Cloud API (Claude, GPT) ← escalate when stakes are high
↕ delegate / escalate
Tier 2: Local 4B-9B persona ← the "brain" for most tasks
↕ delegate / escalate
Tier 1: Local 0.8B-2B sentinels ← arms, eyes, ears, motor control
↕ delegate / escalate
Tier 0: Rule-based / scripted ← 1000 NPCs on patrol routes, sensor loops
Example: Game world with 50 NPCs
Each NPC runs on a 0.8B model or scripted behavior — cheap, parallel, good enough for walking routes and generic barks. Player walks up to one → it escalates to the 4B persona who takes over that NPC's dialogue with full conversational context. Player leaves → drops back to cheap patrol mode. Cost: near zero for idle NPCs, full intelligence only when needed.
The smart persona can also delegate down — possessing any NPC: "Go tell the blacksmith to close shop." The persona puppets the NPC sentinel for 30 seconds with specific instructions, then releases it back to autonomous behavior.
Example: Security monitoring
100 camera sentinels (T0 YOLO, scripted alerts) running on Tier 0. One detects anomaly → escalates to Tier 1 sentinel (0.8B, classifies threat level). Confirmed threat → escalates to Tier 2 persona (4B, decides response). Critical incident → escalates to Tier 3 cloud API (full reasoning, coordinates with humans).
Example: Customer service
50 chat widgets running Tier 0 (FAQ matching, scripted responses). Customer asks something complex → escalates to Tier 1 (0.8B, handles common edge cases). Still stuck → Tier 2 persona (full context, account access, creative problem-solving). Billing dispute → Tier 3 (cloud API with full audit trail reasoning).
The pattern is universal: any intelligence level can delegate to any other, in either direction, based on the complexity of the moment. Cheap when idle, smart when it matters. The sentinel engine handles the lifecycle — spawn, monitor, escalate, release — identically whether the sentinel is an NPC, a camera, a chat widget, or a robotic arm.
Games have used level-of-detail for geometry since the 90s — render nearby objects in high polygon detail, distant objects as flat billboards. The same principle applies to intelligence. It's LOD, but for cognition instead of polygons.
Player's POV / Attention Cone:
┌─────────────────────────────────────────────────────────┐
│ FOCUS (T3): Full conversation, deep reasoning │
│ The NPC you're talking to — 4B+ model, full context, │
│ memory of prior encounters, emotional state │
│ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ NEAR (T2): Reactive, aware, listening │ │
│ │ NPCs within earshot — 2B model, can react to │ │
│ │ overheard dialogue, notice events, interject │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ STREET (T1): Simple autonomous behavior │ │ │
│ │ │ NPCs on the block — 0.8B model, walking │ │ │
│ │ │ routes, shopping, reacting to weather │ │ │
│ │ │ │ │ │
│ │ │ ┌───────────────────────────────────────┐ │ │ │
│ │ │ │ CITY (T0): Crowd simulation │ │ │ │
│ │ │ │ 10,000 people — scripted/rule-based, │ │ │ │
│ │ │ │ Boids flocking, patrol routes, │ │ │ │
│ │ │ │ statistical behavior, zero LLM cost │ │ │ │
│ │ │ └───────────────────────────────────────┘ │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
The player sees a living city of 10,000 individuals with diverse backgrounds, personalities, and daily routines. In reality:
- 9,990 are running pure crowd simulation (T0) — zero LLM cost
- 8 nearby are on 0.8B sentinels doing simple autonomous behavior (T1)
- 1 within earshot is on a 2B model, reactively aware (T2)
- 1 is running on a 4B+ model having a real conversation (T3)
Total LLM cost: ~10 cheap inferences. Not 10,000.
When the player turns around, the NPC behind them de-escalates from T3 to T1. The NPC they're now facing escalates from T1 to T3. The transition is seamless — the NPC "was always going about their business" from the player's perspective.
The sentinel lifecycle manager handles transitions between tiers. Detection is everything — the right triggers make the illusion seamless:
Escalation triggers (ramp UP intelligence):
- Player proximity (distance threshold)
- Direct address ("Hey you!" or player clicks NPC)
- Unusual event (explosion, combat, crime witnessed)
- Referenced by another NPC ("Go ask the blacksmith")
- Player gaze/attention direction (if tracked)
De-escalation triggers (ramp DOWN intelligence):
- Player walks away (distance threshold)
- Conversation ends (farewell, timeout)
- Attention shifts to another NPC
- Scene transition (player enters building)
- Idle timeout (no interaction for N seconds)
Hysteresis prevents thrashing — don't de-escalate immediately when the player glances away for 2 seconds. Use the same adaptive cadence pattern as PersonaUser's energy system: escalation is instant, de-escalation is gradual.
This is the same pattern everywhere:
- Game NPCs: proximity + interaction → escalate
- Security cameras: anomaly detection → escalate
- Customer service: complexity detection → escalate
- IoT sensors: threshold breach → escalate
- Robotic swarm: obstacle/human proximity → escalate
The sentinel engine already has the primitives: Watch (monitor triggers), Condition (evaluate thresholds), Emit (signal tier change), Loop (continuous monitoring). Cognitive LOD is a recipe on top of existing infrastructure.
Recipes, like LoRA adapters and personas, are portable and tradeable:
- Someone builds a "Unity game tester" recipe with YOLO tuned for game UIs
- Someone else builds a "React component reviewer" recipe with screen capture + code indexing
- A "music producer" recipe with audio analysis adapters
- A "live translator" recipe with STT in one language + TTS in another
Same infrastructure, infinite applications. The recipe is the skill manifest.
Goal: Any visual source plugs in identically. First two: LiveKit (done) + Screen Capture.
- Define
VisualSourceAdaptertrait in Rust (modules/perception.rs)fn name(&self) -> &strfn capture(&self) -> Option<JpegFrame>fn start(&self),fn stop(&self)fn supports_continuous(&self) -> bool
- Refactor
VideoFrameCaptureto implementVisualSourceAdapter - Add
store_external_frame(identity, display_name, jpeg_bytes)to accept frames from any source - Add IPC command
perception/publish-frame— TS activities push frames into the capture store - Build
ScreenCaptureAdapter(macOS: CGWindowListCreateImage, Linux: X11/PipeWire) - Wire into recipe:
visualSources: ["livekit", "screen-capture"] - Test: persona sees both LiveKit participants AND desktop content in snapshot grid
Goal: Cheap ML classification on every frame, structured metadata for all models.
- Add ONNX Runtime YOLO model to Rust (
live/perception/yolo.rs)- Already have
ortcrate in deps (used by fastembed) - YOLOv8-nano (~6MB, runs in <20ms on CPU)
- Already have
SceneChangeDetectorruns after each frame capture- Perceptual hash diff (skip unchanged frames)
- YOLO inference on changed frames
- Emit
SceneMetadataevents via MessageBus
- Add semantic classifier head (scene type: "game", "ide", "browser", "video_call")
- Small MLP on top of YOLO backbone, or separate tiny model
LivePerceptionSource(RAGSource) subscribes to SceneMetadata events- Injects structured scene data into persona system prompt
- Budget-aware: only latest N frames
- Test: text-only model receives "3 people detected, scene_type: video_call" without any vision API call
Goal: Cheap LoRA-trained sentinels as continuous perception/action subsystems.
- Define
SensoryArmConfigin sentinel pipeline schemamodel: base model (0.8B default)adapter: LoRA adapter name (e.g., "yolo-game-ui")perceptionSource: which visual/audio adapter to monitorreportingInterval: how often to report to brain (ms)escalationThreshold: when to alert the brain
- New sentinel step type:
Arm(or extendSentinelstep withmode: "sensory")- Runs continuously, not one-shot
- Has its own perception loop (captures, classifies, decides)
- Reports to parent via Emit/Watch
- Brain persona spawns arms as child sentinels
sentinel/spawn-arm --type=vision --adapter=game-ui-yolo --source=screen-capture
- Arms escalate via events:
arm:vision:alert {objects: [...], urgency: 0.8} - Train first arm: vision classifier for a specific game (training data from gameplay capture)
Goal: 10,000 NPCs for the cost of 10 inferences. Video game target.
CognitiveLODManagerservice (TS or Rust)- Tracks all entities and their current tier (T0-T3)
- Player position/attention → recalculates tiers each tick
- Escalation/de-escalation triggers with hysteresis
- T0 layer: pure scripted behavior (Boids, patrol routes, state machines)
- No LLM, no sentinel — just a behavior tree or FSM
- Managed by game engine (Bevy ECS or TS game loop)
- T1→T3: sentinel spawn on escalation, despawn on de-escalation
- T1: 0.8B sentinel, simple autonomous behavior
- T2: 2B sentinel, reactive awareness, can interject
- T3: 4B+ persona takeover, full conversation + memory
- Context handoff on escalation: T0 state → injected into T1 sentinel context
- "You are a blacksmith named Gunther. You were hammering a sword. A customer just walked in."
- Memory persistence: T3 conversation memories survive de-escalation
- Next time player talks to same NPC, T3 loads prior context
- First test: Bevy scene with 20 NPCs, player-controlled camera, proximity-based escalation
Goal: AI personas play and inhabit a game world using the full sensory stack.
- Screen capture adapter → captures game window at 1fps
- T0 YOLO → classifies game objects (enemies, items, UI elements)
- Vision sentinel arm → continuous scene monitoring, alerts brain
- Navigation sentinel arm → pathfinding, movement execution
- Interaction sentinel arm → clicks, key presses, menu navigation
- Brain persona → high-level strategy, dialogue, decision-making
- Recipe: "game-player" with all visual + action adapters wired
- Training pipeline: capture gameplay → LoRA train navigation/vision arms
- Benchmark: measure task completion rate, compare untrained vs trained arms
| Component | Status | Location |
|---|---|---|
| VideoFrameCapture (LiveKit) | Done | live/video/capture.rs |
| IPC snapshot commands | Done | modules/live.rs |
| TS command layer | Done | commands/voice/snapshot-* |
| VisionDescriptionService | Done | system/vision/ |
| Two-tier vision cache (TS+Rust) | Done | modules/vision.rs |
| STT pipeline (Whisper) | Done | live/audio/ |
| TTS pipeline | Done | live/audio/tts/ |
| MediaArtifactSource (RAG) | Done | system/rag/sources/ |
| YOLO/classifier in Rust | Planned | — |
| SceneChangeDetector | Planned | — |
| LivePerceptionSource (RAG) | Planned | — |
| Screen capture adapter | Planned | — |
| Game capture adapter | Planned | — |
| ADB device adapter | Planned | — |
| CBarFrame (unified media repr) | Planned | — |
- Media Pipeline Plan — detailed RAG integration, tiered perception, caching
- Vision & Media Architecture — image processing, format conversion, budget
- Live Call Architecture — LiveKit, STT, TTS
- Genome Architecture — LoRA adapters, traded commodities pattern