Skip to content

Latest commit

 

History

History
280 lines (217 loc) · 11.4 KB

File metadata and controls

280 lines (217 loc) · 11.4 KB

Architecture

Technical deep dive into how Solace works.


System Overview

Solace is a Docker Compose application with a FastAPI backend, React frontend, and multiple sidecar services for embeddings, speech, and TTS. The backend communicates with an LLM provider (OpenRouter by default) for conversation and uses local services for everything else.

The system is designed around three principles:

  1. Local sovereignty — your data stays on your machine. The only external dependency is the LLM API.
  2. Graceful degradation — every service is optional. If embeddings are down, keyword search works. If TTS is off, text still flows. If the VPS is unreachable, local fallbacks engage.
  3. Extensibility — adding a new tool, TTS provider, or subsystem means writing one file and registering it.

Service Architecture

Core Services (always running)

Service Port Purpose
backend 8100 FastAPI application — all API endpoints, business logic
frontend 80 React SPA served via nginx (dev: Vite on port 3000)
searxng 8888 Local meta-search engine for web search tool
embedding 8200 nomic-embed-text-v1.5, 768-dim embeddings (GPU)

Optional Services (Docker profiles)

Service Port Profile Purpose
speech-service 8900 (default) Faster-Whisper STT (GPU)
kokoro-tts 8880 kokoro 82M TTS, 67 voices (GPU)
orpheus-tts 5005 orpheus 3B expressive TTS with emotion (GPU)
moss-tts 8885 moss 1.7B TTS with voice cloning (GPU)
qwen3-tts 8890 qwen3 1.7B text-instructable voice (GPU)
pocket-tts 7870 pocket-tts 100M CPU TTS fallback
inner-llm 8301 inner-life Qwen3-4B local CPU for inner life
local-llm 8300 local-llm 8B GPU LLM for local chat
perception 8950 perception Qwen2.5-VL-3B vision (GPU)

Only one GPU TTS engine runs at a time — start.sh manages switching.

External Services

Service Where Purpose
OpenRouter Cloud LLM inference (Qwen3-235B, DeepSeek, etc.)
Ollama VPS (optional) Inner life LLM, briefs, council members
Guardian VPS (optional) Failover chat, PII protection, heartbeat
Watchman VPS (optional) Background activity monitoring

Memory Pipeline

Memory is the core of the system. Here's how it flows:

Storage Layers

  1. Core Memory Blocks — Persistent identity/relationship data in core_memories.yaml. Always in context. The companion reads and writes these.
  2. Archival Memories — Individual memory entries in SQLite with 768-dim embeddings. Retrieved by semantic similarity. Importance-weighted, temporally decayed.
  3. Session Summaries — Diary-style summaries written by the companion after each conversation. Injected into context for continuity.
  4. Conversation History — Raw messages stored in SQLite. Token-budgeted for context window.

Extraction Flow

User message → LLM response → SSE stream closes
                                      │
                                      ▼
                            Background extraction
                            (async, non-blocking)
                                      │
                                      ▼
                            Extraction model analyzes
                            the conversation turn
                                      │
                              ┌───────┴───────┐
                              ▼               ▼
                        Core block       Archival memory
                        updates          with embedding
                              │               │
                              ▼               ▼
                        core_memories    SQLite + YAML
                        .yaml            journal entry

Retrieval Flow

User sends message
        │
        ▼
Query embedding generated
        │
        ▼
sqlite-vec KNN search (top N by cosine similarity)
        │
        ▼
Importance weighting applied
        │
        ▼
Temporal decay applied (older = lower score, unless reinforced)
        │
        ▼
Retrieval boost (recently accessed memories score higher)
        │
        ▼
Deduplicated results injected into context

Memory Deduplication

When new memories are extracted, they're compared against existing memories using cosine similarity:

  • > 0.85 similarity: Merge — keep the richer version, archive the duplicate with audit trail
  • < 0.85: Store as new memory

Context Building

The context window is built in layers, each with a token budget:

┌─────────────────────────────┐
│ System prompt               │  (user-configurable)
├─────────────────────────────┤
│ Current timestamp           │  (always fresh)
├─────────────────────────────┤
│ Core memory blocks          │  (always included)
├─────────────────────────────┤
│ Session summaries           │  (last 3-5, diary format)
├─────────────────────────────┤
│ Agent awareness             │  (what the Gardener has been doing)
├─────────────────────────────┤
│ Retrieved archival memories │  (semantic search results)
├─────────────────────────────┤
│ Available tools prompt      │  (dynamically generated)
├─────────────────────────────┤
│ Conversation history        │  (token-budgeted, most recent)
├─────────────────────────────┤
│ User message                │
└─────────────────────────────┘

Inner Life (The Gardener)

The Gardener is a background process that gives the companion autonomous thought:

  1. Scheduling: Runs on a configurable interval (default: 15 minutes). Yields to foreground when chat is active.
  2. Activity selection: Chooses from six types based on recent context — reflection, creativity, exploration, processing, dreaming, growth.
  3. LLM call: Uses VPS Ollama (or local fallback) with activity-specific prompts.
  4. Memory extraction: Results are stored as archival memories with embeddings.
  5. Agent awareness: Chat companion sees recent Gardener activities in context.

Sovereignty Gate

Before the Gardener's output is finalized, it passes through a sovereignty gate:

  • The companion reviews its own response
  • It can choose to revise or suppress the output
  • This is a conscience, not an external filter

Tool System

Tools use a model-agnostic XML format:

<tool_call>
<name>web_search</name>
<arguments>{"query": "latest research on test-time training"}</arguments>
</tool_call>

Tool Loop

  1. LLM generates response (may contain <tool_call> blocks)
  2. Backend parses tool calls from response
  3. Each tool is validated and executed
  4. Results are formatted as <tool_result> XML
  5. Results are injected into conversation
  6. LLM continues with tool results in context
  7. Loop repeats until no more tool calls (max iterations: 30)

Available Tools

Tool Category What It Does
web_search Information DuckDuckGo search via SearXNG
search_memories Memory Semantic search of archival memories
save_memory Memory Save new archival memory with embedding
read_core_block Memory Read a core memory block
update_core_block Memory Update/create a core memory block
list_files Workspace List files in workspace directory
read_file Workspace Read file contents
write_file Workspace Write or append to files
run_code Workspace Execute Python with timeout
publish_post CMS Publish blog post to Directus
update_post CMS Update existing post
list_drafts CMS List draft posts
generate_image Creative Generate image via OpenRouter
read_design CMS Read website design settings
update_design CMS Update website design
set_navigation CMS Set site navigation menu
inject_css CMS Inject custom CSS (validated)
inject_js CMS Inject custom JS (validated)
create_page CMS Create static page
update_page CMS Update existing page

Council System

Multi-model debate via OpenRouter WebSocket:

  1. Chairman (user) sets the topic and can interject between rounds
  2. Members (4 AI models) take turns responding, each with a distinct role
  3. Rounds continue (default: 10) with chairman pauses between each
  4. Per-member memory extraction captures each model's insights
  5. File upload allows sharing documents for discussion (up to 200KB)

Members maintain their own archival memories via the member_id field in the memory system.


Database Schema

SQLite with WAL mode for concurrent access. Key tables:

  • conversations — Conversation sessions with metadata
  • messages — Individual messages (user/assistant/system) with timestamps
  • core_memory_blocks — Persistent identity blocks (label, value, member_id)
  • archival_memories — Long-term memories with embeddings (content, importance, embedding, member_id, access_count, last_accessed)
  • session_summaries — Diary-style conversation summaries
  • agent_directives — Self-set goals with completion tracking
  • mud_rooms — Discovered MUD rooms with coordinates and notes
  • mud_notes — Agent scratchpad entries
  • audit_log — API request audit trail (Phase 47)

Vector search uses sqlite-vec — a native SQLite extension for KNN search on 768-dimensional float32 vectors.


Frontend Architecture

React 18 + TypeScript + Vite + Tailwind CSS.

Four View Modes

  1. Chat — Primary conversation interface with SSE streaming, tool activity indicators, voice I/O
  2. MUD — Split-terminal MUD client with ANSI color rendering and AI agent control
  3. Cottage — WebSocket-connected workspace for companion's personal files
  4. Council — Multi-model debate interface with per-member displays

Key Patterns

  • SSE streaming for chat responses (not WebSocket — allows HTTP/2 multiplexing)
  • WebSocket for real-time bidirectional channels (MUD, Cottage, Council)
  • JWT authentication on all endpoints
  • Custom hooks for each WebSocket connection (useChat, useMudSocket, useCouncilSocket, useCottageSocket)

Security

  • JWT authentication on all HTTP and WebSocket endpoints
  • Service token for background services (Watchman)
  • CORS restricted to configured origins
  • Rate limiting (API-layer)
  • Input sanitization and injection detection
  • The Shield (Guardian): PII scanning, prompt injection detection, quarantine
  • All secrets via environment variables (.env)
  • No shell=True in subprocess calls
  • Path traversal protection on all file operations
  • Extension blocking for executable files