Real-time live transcription with instant fact-checking against your company documents — while the meeting is still happening.
Features · Architecture · Deep Dives · Quick Start · Engineering Decisions
Someone in a meeting says "Invoice 10332 totals $4,500" — but the actual invoice says $3,850. Nobody catches it. The decision moves forward on bad data.
Enterprise teams make critical decisions referencing documents they can't search fast enough — invoices, purchase orders, shipping records, inventory reports. Tables break across PDF pages, exact IDs get buried in 800+ near-identical documents, and nobody fact-checks claims in real time.
AIlways changes that. It listens to your meeting, identifies factual claims as they're spoken, and verifies them against your documents — live, with cited evidence.
| What happens | How |
|---|---|
| You speak into your mic (or capture an entire meeting) | Browser AudioWorklet captures 16kHz PCM, streamed over WebSocket |
| Words appear in real time with speaker diarization | Deepgram Nova 3 with per-channel speaker identification |
| Factual claims are extracted as you speak | Groq Llama 3.3 70B detects verifiable assertions in ~200ms |
| Each claim is verified via an Adaptive Corrective-RAG pipeline | LangGraph state machine with smart query routing |
| You see ✅ Supported, ❌ Contradicted, or |
Verdict + confidence + exact quotes from source documents |
Two audio modes:
- Mic Only — captures your microphone for dictation, interviews, or solo note-taking.
- Meeting Mode — captures mic + system audio via screen share, with multichannel processing so your voice and remote participants are separated and diarized independently.
A multi-turn chat interface powered by a LangGraph ReAct agent with 5 specialized tools. The agent decides how to answer each query — a question like "how many invoices from July 2016?" routes to an SQL fast-path (~83ms, zero LLM), while "what's the total on invoice 10332?" triggers an entity-boosted hybrid search. Streaming answers with inline citations, confidence scoring, and evidence sufficiency panel.
Organize documents into vaults with role-based access (owner / editor / viewer). Upload PDFs, TXT, or Markdown. Documents are parsed, chunked, embedded, and indexed automatically through an event-driven pipeline — 300 documents ingested in under 15 seconds.
Every transcription session is persisted with full transcript, speaker attribution, and all detected claims with their verdicts. Review past sessions, rename them, search across history.
| Metric | Result | How |
|---|---|---|
| PDF parsing (pdfplumber) | 830 docs in 6s (~7ms/doc) | Process pool, zero LLM |
| End-to-end ingestion | 300 docs in ~15s | Kafka batch workers + single embedding API call |
| Aggregate queries | ~83ms | SQL fast-path — regex classify → ORM query → deterministic verdict, no LLM |
| Claim detection | ~200ms per batch | Groq Llama 3.3 70B |
| Point queries (RAG) | 1–3s | Hybrid search + LLM grading + LLM synthesis |
| Transcription | Real-time | Deepgram Nova 3 streaming WebSocket |
| Copilot first token | <1s | Streaming SSE |
flowchart TB
subgraph Client["Browser"]
MIC[🎙️ Microphone] --> AW[AudioWorklet<br/>16kHz PCM]
SYS[🖥️ System Audio] -.->|meeting mode| AW
AW --> WS_OUT[WebSocket]
WS_IN[WebSocket] --> UI[Live Transcript<br/>+ Claim Cards]
CHAT[Copilot Chat] --> SSE[SSE Stream]
end
subgraph Frontend["Next.js 16 — BFF"]
PAGES[App Router] --> API_ROUTES[API Routes<br/>Proxy + Auth]
end
subgraph Backend["FastAPI"]
direction TB
AUTH[Auth<br/>Argon2 + CSRF + Sessions]
WS_HANDLER[WS Handler<br/>Live Transcription]
AGENT[Copilot Agent<br/>LangGraph ReAct]
UPLOAD[Upload Router<br/>Document Ingestion]
end
subgraph Transcription["Live Pipeline"]
direction TB
DG[Deepgram Nova 3<br/>Real-time STT]
BUFFER[Transcript Buffer<br/>Claim Batching]
DETECT[Groq Llama 3.3 70B<br/>Claim Extraction]
CRAG[LangGraph CRAG<br/>Adaptive Verification]
end
subgraph RAG["Hybrid RAG"]
direction TB
PARSE[PDF Parser<br/>pdfplumber · tables · process pool]
CHUNK[Recursive Chunker<br/>512 tok · Markdown-aware]
EMBED[OpenAI Embeddings<br/>text-embedding-3-large]
SEARCH[4-Way Retrieval<br/>Dense · BM25 · Entity · SQL]
GEN[GPT-4o-mini<br/>Cited Generation]
end
subgraph Infra["Infrastructure"]
PG[(ParadeDB<br/>pgvector + BM25)]
REDIS[(Redis<br/>Sessions · Cache)]
KAFKA[Redpanda<br/>Event Streaming]
WORKERS[Workers<br/>Ingestion · Deletion · Audit]
end
WS_OUT --> WS_HANDLER
WS_HANDLER --> DG
DG --> BUFFER --> DETECT --> CRAG
CRAG --> SEARCH
WS_HANDLER --> WS_IN
CHAT --> API_ROUTES --> AGENT --> SEARCH --> GEN
UPLOAD --> KAFKA --> WORKERS
WORKERS --> PARSE --> CHUNK --> EMBED --> PG
SEARCH --> PG
AUTH --> REDIS
The system uses rule-based regex classification (zero LLM cost, sub-millisecond) to route every query to the optimal path:
User query
│
├─ rewrite_query() ── coreference resolution via gpt-4o-mini
│
├─ classify_query_type() ── regex: 13 aggregate patterns + 2-tier point override
│
▼
┌────────────────┐ ┌────────────────┐
│ AGGREGATE │ │ POINT │
│ "how many │ │ "what's the │
│ invoices?" │ │ total on │
│ │ │ invoice 10332"│
├────────────────┤ ├────────────────┤
│ parse filters │ │ extract entity │
│ (regex: type, │ │ IDs (regex) │
│ date, cust.) │ │ │
│ ↓ │ │ ↓ │
│ SQL ORM query │ │ entity_search │
│ COUNT/SUM/LIST │ │ + hybrid_search│
│ ↓ │ │ (dense + BM25) │
│ deterministic │ │ ↓ │
│ verdict │ │ RRF + MMR │
│ confidence:1.0 │ │ ↓ │
│ ↓ │ │ LLM grading │
│ ~83 ms │ │ ↓ │
│ zero LLM calls │ │ LLM synthesis │
│ │ │ ↓ │
│ │ │ 1-3 seconds │
└────────────────┘ └────────────────┘
🎙️ Real-Time Transcription Engine
The frontend captures audio through the Web Audio API's AudioWorkletNode, running on the audio rendering thread for zero-skip processing:
- Mic-only mode: Mono PCM at 16kHz with explicit
channelCount: 1, channelCountMode: "explicit"to prevent browser upmixing mono to stereo (which would send garbled interleaved data to Deepgram). - Meeting mode: System audio captured via
getDisplayMedia, merged with mic audio through aChannelMergernode into stereo. Channel 0 = mic (local user), Channel 1 = system (remote participants). Video track is immediately discarded. - Buffering: The worklet accumulates 4,096 frames (~256ms) before posting to the main thread, reducing WebSocket message rate from ~125/s to ~4/s.
- Encoding: Float32 → Int16 PCM conversion (
s < 0 ? s * 0x8000 : s * 0x7FFF) in the main thread before sending over WebSocket. - AudioContext guard: Explicit
resume()call handles Chrome's autoplay policy when the user gesture context has expired after async operations.
- Model: Nova 3 with real-time streaming + speaker diarization.
- Multichannel processing: In meeting mode,
multichannel=truetells Deepgram to process each channel independently — the local speaker is always identified correctly on channel 0. - Speaker remapping: Channel 0 → always Speaker 0 (local user). Channel 1 → Speaker N+1 (avoids ID collision).
- Endpointing: 500ms silence threshold for utterance boundaries.
- Interim results: Sent to the client for immediate visual feedback; replaced when final results arrive.
Five concurrent async tasks run per WebSocket session, coordinated via asyncio.Event:
| Task | Purpose | Interval |
|---|---|---|
| Audio sender | Forwards browser PCM → Deepgram | Continuous |
| Receiver loop | Deepgram → buffer → client | ~500ms |
| Flush timer | Triggers claim extraction | 1s |
| DB flush | Batch-writes segments to Postgres | 2s |
| Heartbeat | WebSocket keep-alive ping | 30s |
One-time ticket system: client requests a ticket (POST /auth/ws-ticket), which is stored in Redis with a 60s TTL. On WebSocket connect, the ticket is atomically consumed (get + delete via Redis pipeline) — prevents replay attacks. No long-lived tokens in query params.
🧠 Claim Detection & Adaptive Corrective-RAG Verification
Claims are extracted from speech segments using a prompt that:
- Identifies 5 statement types: point lookups, aggregate queries, factual assertions, category/inventory queries, and comparisons.
- Produces self-contained claims with full entity references (no dangling pronouns).
- Normalizes numbers (removes thousand separators that Deepgram's
smart_formatinserts). - Skips opinions, greetings, hypotheticals, and questions.
- Uses a sliding context window (last 10 checked segments) for co-reference resolution.
- Preserves short segments containing entity anchors (numeric IDs, currency amounts) even below the min-word threshold.
Performance: ~200ms per batch via Groq's optimized inference. Exponential backoff with 3 retries.
Standard Jaccard would flag "invoices from July 2016" and "invoices from August 2016" as duplicates (~80% overlap). The system applies discriminator boosting — semantically important tokens (months, numbers ≥3 digits, entity IDs, uncommon words) are replicated 3× in the word bag. This drops the overlap to ~50%, well below the 0.8 dedup threshold.
Each claim is verified through a LangGraph state machine that adapts its strategy based on query type:
START → classify (regex, 0ms)
│
├─ AGGREGATE ──→ SQL fast-path
│ ├─ success (1.0 confidence) → END
│ └─ metadata gaps → fallback retrieval → synthesise → END
│
└─ POINT ──→ entity-boosted hybrid search
→ LLM relevance grading
├─ relevant → synthesise verdict → END
└─ not relevant → transform query → retry (up to N) → synthesise → END
Point path (Corrective-RAG loop):
- Retrieve — Entity-ID boosted hybrid search (exact SQL
ILIKE+ dense + BM25 with RRF + MMR at λ=1.0 for pure relevance). - Grade — LLM checks if retrieved evidence is actually relevant (JSON:
{"relevant": true/false}). - Transform & retry — If evidence is irrelevant, an LLM rewrites the query and retrieval runs again.
- Synthesise — LLM produces verdict:
supported/contradicted/unverifiablewith confidence, explanation, and exact quotes.
Aggregate fast-path (zero LLM):
- Parse — Regex extracts intent (count/sum/average/list), document type, date range, customer ID.
- Validate — Checks for metadata gaps (e.g., missing
total_priceprevents sum). - Query — ORM
COUNT/SUMon theDocumenttable (indexed columns). - Return — Deterministic verdict with confidence 1.0. ~83ms end-to-end.
Caching: Redis-backed verdict cache keyed on vault_id + vault.updated_at + SHA256(statement). Cache is automatically invalidated when any document in the vault changes.
🤖 Agentic RAG Copilot — LangGraph ReAct Agent
The copilot is a LangGraph ReAct agent — an LLM that reasons about which tools to call, observes the results, and decides whether to call more tools or produce a final answer. The agent has access to 5 tools:
| Tool | Used for | Retrieval method |
|---|---|---|
search_documents |
Semantic questions | Hybrid search (dense + BM25 + RRF + MMR) |
lookup_entity |
Entity-specific queries | Direct SQL ILIKE on chunk content |
filter_documents |
Aggregate / enumeration queries | ORM-based structured filtering (type, date, customer) |
get_full_document |
Detailed document analysis | All chunks concatenated (up to 30K chars) |
compute |
Math over tool results | Safe Python eval (restricted builtins) |
Before the agent runs, the query is classified via zero-cost regex and a planning hint is injected into the system prompt:
- Aggregate → "Use
filter_documents, NOTsearch_documents" - Compute → "Use
filter_documentsthencompute" - Point → "Use
lookup_entityorsearch_documents"
This prevents the common failure mode where an LLM picks the wrong retrieval strategy for aggregate queries.
| Guard | Mechanism |
|---|---|
| Iteration limit | Max 6 agent loops — if exceeded, forces a "compile final answer" with what it has |
| Full-doc budget | Max 3 get_full_document calls per query — prevents context window overflow |
| Per-call timeout | 45s per LLM call — returns graceful fallback on timeout |
| Overall timeout | 120s for entire agent run |
| Error isolation | SafeToolNode catches tool exceptions and returns error messages instead of crashing |
This is how "how many invoices from July 2016?" gets answered in ~83ms:
-
Parse NL → structured filters (regex, no LLM):
parse_document_type("invoices from July 2016")→"invoice"parse_date_range("invoices from July 2016")→(2016-07-01, 2016-07-31)parse_customer_id(...)→None
-
Build ORM query (SQLAlchemy, not raw SQL):
select(Document).where( Document.vault_id == vault_id, Document.document_type == "invoice", Document.order_date >= date(2016, 7, 1), Document.order_date <= date(2016, 7, 31) )
-
Execute against indexed columns (
ix_documents_vault_doctype,ix_documents_vault_orderdate). -
Return formatted list with metadata (title, entity_id, date, customer, price) + count + grand total.
No embeddings. No LLM. Pure indexed SQL.
After the agent completes, citations are extracted from tool responses by parsing [Source: filename] headers, === FULL DOCUMENT: ... === markers, and numbered list formatting from filter_documents output.
📄 RAG Pipeline — Ingestion to Retrieval
Ingestion Pipeline
Event-driven architecture with Kafka (Redpanda) for throughput and resilience:
Upload → SHA-256 dedup → Kafka → Worker batch (up to 20 docs)
├─ Parse (pdfplumber, process pool) ~7ms/doc
├─ Chunk (512 tok, Markdown-aware) ~1ms/doc
├─ Metadata (regex + concurrent LLM enrichment) ~7s total for batch
├─ Embed ALL chunks in ONE API call single request
└─ Bulk insert with per-doc savepoints atomic per document
Key design choices:
- Cross-document embedding batching — Chunks from all documents in a batch (up to 2,048 texts) are concatenated into a single OpenAI API call. This is the single biggest throughput optimization — 5–10× faster than per-document embedding.
-
Concurrent metadata enrichment — LLM metadata extraction for all documents runs concurrently behind an
asyncio.Semaphore(10). Wall-clock time is ~7s regardless of batch size. -
Savepoint isolation — Each document gets a SQL savepoint (
BEGIN NESTED). One corrupt PDF doesn't roll back the entire batch. - Graceful degradation — If Kafka is unavailable, uploads fall back to synchronous ingestion. If LLM metadata fails, regex-only extraction continues.
-
Dead letter queue — Failed events route to
ingestion.dlqwith retry metadata. DLQ failures never crash the consumer. -
Stuck document recovery — The worker runner periodically re-queues documents stuck in
pending/ingestingfor more than 5 minutes.
PDF Parsing — pdfplumber with Table Heuristics
Custom pipeline producing clean Markdown (~7ms per document, zero LLM):
| Feature | Implementation |
|---|---|
| Dual table strategy | ≥4 edges → bordered (line-based detection); fewer → borderless (text-alignment) |
| Multi-page tables | Column alignment heuristics merge tables across page breaks (≤15px tolerance) |
| Header deduplication | Removes repeated headers when a table continues on a new page |
| Right-edge repair | Appends overflow characters beyond the table's right boundary |
| Alpha bleed fix | Separates characters bleeding from text columns into adjacent numeric columns |
| Spillover cleanup | Strips title text that leaked across table cells |
| Process pool | 2 workers bypass the GIL for concurrent PDF processing |
Text extraction classifies each line by font size and position into headings, key-value pairs, footnotes, and body text — all sorted by y-position to maintain reading order.
Chunking
RecursiveCharacterTextSplitter with tiktoken (cl100k_base):
- 512 tokens per chunk, 50 token overlap.
- Markdown-aware separator hierarchy:
##→###→\n\n→\n→.→→"". - Source header prepended:
[Source: invoice 10332]— boosts BM25 relevance for ID-specific queries. - Content-addressable dedup via SHA-256 hash per chunk.
Metadata Extraction — Two-Phase
Phase 1: Regex (always, <1ms) — extracts document_type, entity_id, order_date, customer_id, total_price from filename conventions and content patterns.
Phase 2: LLM Enrichment (optional, ~7s concurrent for batch) — extracts summary, keywords, hypothetical_questions (HyDE), and structured entities.
Merge strategy: Regex wins for core identifiers; LLM wins for enrichment fields.
HyDE Metadata Chunks
A synthetic chunk is generated per document containing its structured metadata + summary + keywords + hypothetical questions. This chunk participates in all search paths (dense, BM25, hybrid) with zero retrieval-layer changes. It bridges the vocabulary gap — a user asking "What was the total on invoice 10248?" matches the hypothetical question "What is the total price of invoice 10248?" via dense search, AND matches keywords via BM25.
4-Way Retrieval Strategy
| Strategy | Method | When | Speed |
|---|---|---|---|
| Dense | pgvector <=> cosine similarity |
Semantic queries | ~50ms |
| Sparse | ParadeDB BM25 @@@ operator |
Exact keyword matching | ~30ms |
| Entity | SQL ILIKE on chunk content |
Entity ID detected in query | ~10ms |
| SQL Metadata | ORM query on Document table | Aggregate/filter queries | ~5ms |
Fusion pipeline:
-
Parallel fetch — Dense and sparse run concurrently with
fetch_k = max(top_k × 6, 40). -
Reciprocal Rank Fusion — Merges ranked lists:
$\text{score}(d) = \sum \frac{1}{k + \text{rank} + 1}$ with$k = 60$ . -
Maximal Marginal Relevance — Selects final
top_kbalancing relevance and diversity (λ=0.7 for copilot, λ=1.0 for claim verification). - Entity merge — SQL entity results (score=1.0) merged with hybrid results, duplicates removed.
Generation
- Model: GPT-4o-mini (temperature 0.1, JSON mode).
- Strict grounding: Answers only from provided context. Reports when evidence is insufficient.
-
Output: Answer text, citations (doc title, section, page, exact quote), confidence score,
has_sufficient_evidenceflag. -
Streaming: Server-Sent Events with token deltas → final structured
doneevent with parsed JSON. - Embedding cache: Redis-backed query embedding cache (5-minute TTL).
🔐 Auth & Security
| Layer | Implementation |
|---|---|
| Password hashing | Argon2id via pwdlib (memory-hard, GPU-resistant) |
| Sessions | Redis-backed, 7-day TTL. session_id cookie (httpOnly, Secure, no JS access) |
| CSRF | Double-submit cookie pattern. csrf_token cookie (JS-readable) + X-CSRF-Token header on all mutating requests |
| WebSocket auth | One-time ticket: 60s TTL, atomically consumed via Redis pipeline on first use |
| Role-based access | Three-tier vault permissions (owner > editor > viewer) enforced at router level |
| Rate limiting | Registration and login: 5 requests/minute |
| Frontend | Next.js middleware with server-side getMe() validation + 401 → redirect deduplication |
| Redirect guard | Module-level boolean prevents multiple SWR fetchers from triggering parallel /signin redirects |
⚙️ Event-Driven Worker System
Three Kafka consumer workers, all extending a BaseWorker (Template Method pattern):
| Worker | Topic | Mode | Purpose |
|---|---|---|---|
| Ingestion | file.events (file.uploaded) |
Batch (up to 20) | Parse → chunk → embed → store |
| Deletion | file.events (file.deleted) |
Single event | Clean up chunks and embeddings |
| Audit | audit.events |
Single event | Persist audit trail |
Event schemas use discriminated unions — a single file.events topic carries both file.uploaded and file.deleted events, discriminated by a type field.
Worker runner (python -m app.workers.runner):
- Auto-scales Kafka partition count to match
INGESTION_CONCURRENCY. - Recovers stuck documents (re-queues after configurable timeout).
- Graceful shutdown on SIGTERM/SIGINT.
- Supports starting individual workers or all at once.
Resilience patterns:
- At-least-once delivery with manual offset commit.
- Idempotent processing (re-delivered active documents are skipped).
- Failed events routed to DLQ; DLQ failures are logged but never re-raised.
- Synchronous fallback when Kafka is unavailable.
Why Hybrid Search over Dense-Only?
Dense embeddings fail catastrophically on structurally identical documents. When 830 invoices share the same PDF template and similar line items, their embeddings converge. Asking "What's the total for invoice 10332?" returns chunks from the wrong invoice. BM25 solves this — exact keyword matching finds "invoice 10332" reliably. RRF merges both without needing score normalization.
Why Regex Classification instead of LLM Routing?
Query classification runs on every user message and every detected claim. Using an LLM would add 200ms–1s per classification. The regex classifier (13 aggregate patterns + 2-tier point override) achieves the same accuracy for the supported query types at zero cost and zero latency. It correctly distinguishes "all invoices from July 2016" (aggregate) from "the invoice for VINET" (point) and "total price of all July 2016 invoices" (compute).
Why an SQL Fast-Path for Aggregates?
When a user asks "how many purchase orders from Q3 2016?", the answer is a COUNT query on an indexed column. Running it through embeddings + LLM would take 2–3 seconds and risk hallucination. The SQL fast-path (regex parse → ORM query → deterministic result) returns in ~83ms with 1.0 confidence. The system validates metadata completeness before using this path — if data gaps exist, it falls back to the full RAG pipeline.
Why Kafka (Redpanda) over Celery?
Native partitioning by vault ID guarantees ordering. Consumer groups enable horizontal scaling. Batch accumulation enables the cross-document embedding optimization (single API call for N documents). Dead letter queues are a first-class primitive. Synchronous fallback preserved for single-node deployments.
Why ParadeDB over Plain Postgres?
ParadeDB adds native BM25 full-text search (pg_search) alongside pgvector — both in-process, no external Elasticsearch needed. Single database for vectors, full-text, and relational data. Graceful degradation: if ParadeDB isn't installed, BM25 falls back silently and only dense search is used.
Why Groq for Claim Detection?
Claim detection runs on every speech pause (~every 2–6 seconds). At ~200ms per batch, Groq's Llama 3.3 70B is fast enough to feel real-time. GPT-4o would add 1–3 seconds per claim batch — noticeable during a live conversation.
Why Protocol-Based Architecture?
Parsers, chunkers, embedders, transcribers, and generators all use Python Protocol classes (@runtime_checkable). Swap Deepgram for Whisper, or OpenAI embeddings for Cohere — no inheritance chains, just implement the interface. Factory functions (get_parser, get_chunker, get_embedder) handle instantiation.
Why pdfplumber over LLM-Based Parsing?
LLM-based PDF parsing (e.g., GPT-4o vision) costs ~$0.01–0.05 per page and takes 2–5 seconds. For 2,676 enterprise PDFs, that's $50–250 and hours of processing. pdfplumber parses 830 documents in 6 seconds — zero API cost, deterministic output, and custom table heuristics handle the edge cases (multi-page tables, borderless layouts, alpha bleed) that generic parsers miss. LLM parsing is the right choice for unstructured documents with complex layouts; pdfplumber is the right choice for high-volume structured business documents.
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | Next.js 16, React 19 | App Router, SSR, BFF proxy |
| Tailwind CSS v4 | Styling + responsive design | |
| SWR + SSE | Data fetching + real-time streaming | |
| Web Audio API | AudioWorklet for zero-skip PCM capture | |
| Backend | FastAPI + Uvicorn | Async API, WebSocket server |
| LangGraph | ReAct agent + Corrective-RAG state machines | |
| SQLModel + Alembic | Async ORM + schema migrations | |
| Deepgram SDK v6 | Nova 3 real-time STT + diarization | |
| OpenAI | Embeddings (text-embedding-3-large), generation (GPT-4o-mini), verification (GPT-4o) | |
| Groq | Fast claim detection (Llama 3.3 70B, ~200ms) | |
| pdfplumber | Table-aware PDF → Markdown parsing | |
| Infrastructure | ParadeDB | Postgres + pgvector (1536d) + BM25 — single DB |
| Redis 7 | Sessions, CSRF tokens, WS tickets, embedding cache, verdict cache | |
| Redpanda | Kafka-compatible event streaming + worker orchestration |
.
├── frontend/ # Next.js 16 — UI + BFF (6,200 lines)
│ ├── app/
│ │ ├── (auth)/ # Sign in / Sign up
│ │ ├── (app)/ # Protected: dashboard, copilot, transcription,
│ │ │ # vaults, sessions, history, settings
│ │ └── api/ # BFF proxy routes → backend
│ ├── components/ # Feature-scoped UI components
│ │ ├── copilot/ # Chat interface + citations
│ │ ├── transcription/ # Live recording + claim cards
│ │ ├── vaults/ # Vault management + document upload
│ │ ├── sessions/ # Session history browser
│ │ └── ui/ # Shared primitives
│ ├── hooks/ # useTranscription, useConversations
│ └── lib/ # API client, types, auth, utilities
│
├── backend/ # FastAPI — API + Workers (14,700 lines)
│ ├── app/
│ │ ├── api/routers/ # auth, vault, documents, query, transcription, sessions
│ │ ├── core/
│ │ │ ├── copilot/ # LangGraph agent, classification, tools, verification
│ │ │ ├── rag/ # parsing, chunking, embedding, retrieval, generation
│ │ │ ├── transcription/ # Deepgram, buffer, persistence
│ │ │ ├── kafka/ # producer, consumer, DLQ, topics
│ │ │ ├── auth/ # sessions, CSRF, WS tickets, RBAC
│ │ │ └── storage/ # file storage abstraction
│ │ ├── db/models/ # 10 SQLModel tables
│ │ └── workers/ # ingestion, deletion, audit + runner
│ └── tests/
│
├── data/ # 2,676 enterprise PDFs
│ └── CompanyDocuments/
│ ├── invoices/ # 830 invoices
│ ├── PurchaseOrders/ # 830 purchase orders
│ ├── Shipping orders/ # 809 shipping orders
│ └── Inventory Report/ # 2 stock reports
│
└── learnings/parsing/ # Standalone PDF parsing experiments
10 tables managed through SQLModel + Alembic:
users ─┬─► vaults ───► documents ───► chunks
│ │ │ ├── content + embedding (pgvector 1536d)
│ │ │ ├── content_with_header (BM25 indexed)
│ │ │ └── chunk_type: parent | child | metadata
│ │ │
│ │ ├── document_type, entity_id ← regex-extracted
│ │ ├── order_date, total_price ← regex-extracted
│ │ ├── summary, keywords ← LLM-enriched
│ │ └── hypothetical_questions ← HyDE
│ │
│ └─► vault_members (owner / editor / viewer)
│
└─► transcription_sessions ─┬─► transcription_segments
└─► transcription_claims
├── verdict: supported | contradicted | unverifiable
├── confidence, explanation
└── evidence_json (cited quotes)
audit_log (event sourcing — all auditable actions)
feedback (user quality ratings)
1. User signs in → session cookie + CSRF token set
2. Opens Transcription → selects vault → clicks "Start Recording"
3. Frontend requests one-time WS ticket (POST /auth/ws-ticket)
4. AudioContext created at 16kHz → AudioWorklet loaded
5. Mic access granted → PCM frames buffered at ~256ms chunks
6. WebSocket opened with ticket → backend authenticates (atomic Redis consume)
7. Backend opens Deepgram Nova 3 live WebSocket
8. Audio flows: Browser → Backend WS → Deepgram WS
┌─ Every ~0.5s: Deepgram returns transcript segments
│ → Buffered in TranscriptBuffer
│ → Persisted to DB (batched every 2s)
│ → Sent to client: { type: "transcript", speaker, text, ... }
│
├─ Every 1s: Flush timer checks should_trigger_claims()
│ → If speaker idle ≥ 2s OR batch interval ≥ 6s:
│ → Groq extracts claims (~200ms)
│ → Deduplicated via boosted Jaccard fingerprinting
│ → Client receives: { type: "claim_detected", ... }
│ → LangGraph CRAG verifies each claim:
│ ├─ Aggregate → SQL fast-path (~83ms)
│ └─ Point → hybrid search + LLM verdict (1-3s)
│ → Client receives: { type: "claim_verified", verdict, evidence }
│
└─ Every 30s: Heartbeat ping
9. User clicks Stop → finalize Deepgram → drain remaining segments + claims
10. Session persisted with full transcript, speaker attribution, and all verdicts
11. Reviewable anytime from Sessions page
Prerequisites: Python 3.12+, Node.js 20+, Docker
cd backend
cp .env.example .env # Add API keys: DEEPGRAM_API_KEY, OPENAI_API_KEY, GROQ_API_KEY
docker compose up -d # ParadeDB, Redis, Redpandacd backend
uv sync # Install dependencies
uv run alembic upgrade head # Run migrations
uv run python -m app # Start API server → http://localhost:8080cd backend
uv run python -m app.workers.runner --worker all # Start all workers
# Or individually:
# uv run python -m app.workers.runner --worker ingestion
# uv run python -m app.workers.runner --worker deletioncd frontend
npm install
npm run dev # Start dev server → http://localhost:3000- Sign up at
localhost:3000→ create a vault → upload PDFs. - Copilot → select your vault → ask questions → watch citations appear.
- Transcription → select vault → Start Recording → speak a factual claim → watch real-time verification.
Single Pydantic Settings class with nested sub-configs, all .env-backed with sensible defaults:
Settings # 50+ settings
├── ClaimConfig # 16 — detection batching, dedup threshold, concurrency
├── CopilotConfig # 20+ — agent iterations, timeouts, cache TTL, models
│ ├── Extraction # model, temperature, retries
│ ├── Verification # top_k, cache, MMR lambda
│ ├── Grading # relevance grading model
│ └── Agent # ReAct config: max iterations, timeouts, full-doc budget
├── TranscriptionConfig # 8 — session limits, DB flush interval, WS ticket TTL
├── MetadataConfig # 6 — LLM enrichment: summary words, keywords, HyDE questions
└── WorkerConfig # 3 — batch size, timeout, concurrency-
Async is hard, but essential for real-time. The ingestion pipeline performance (300 docs in 15s) was achieved solely through async workers and event-driven architecture with Redpanda.
-
RAG is not just retrieval. The pipeline has multiple filtering stages — entity ID lookup, BM25 keyword matching, dense semantic search, SQL metadata queries, and re-ranking — each solving different failure modes.
-
Metadata is the secret weapon. With proper metadata extraction and HyDE chunks at ingestion time, retrieval quality improves dramatically without changing the retrieval code.
-
The right tool for each query type matters more than a better model. An indexed SQL COUNT returns in 83ms with perfect accuracy. No amount of prompt engineering makes an LLM faster or more reliable for that task.
-
Building the RAG pipeline was harder than transcription. I initially expected real-time audio processing to be the biggest challenge. In practice, making retrieval work correctly across 2,676 structurally identical documents — and handling all the edge cases (multi-page tables, entity disambiguation, aggregate vs. point queries) — was the most challenging and rewarding part.
MIT — see LICENSE.
