Skip to content

dusk-network/stroma

Stroma

Stroma is a neutral corpus and indexing substrate.

It owns the lowest-level operations needed to ingest text artifacts, chunk them, embed them, persist them in SQLite plus sqlite-vec, retrieve semantically close sections, and call OpenAI-compatible embedding and chat completion endpoints over a shared HTTP substrate. Callers consume Stroma through its APIs and treat the SQLite snapshot as an opaque local artifact. It does not own governance, specifications, compliance, drift analysis, prompt templates, product-specific output semantics, MCP, or CLI workflows.

Scope

Stroma is for products that need a reusable text corpus layer with:

  • canonical records with deterministic content fingerprints
  • pluggable chunking strategies (chunk.PolicyMarkdownPolicy default, KindRouterPolicy for per-record-kind dispatch, LateChunkPolicy for parent/leaf hierarchy)
  • pluggable embedders (Embedder / ContextualEmbedder) with a deterministic fixture and an OpenAI-compatible HTTP embedder
  • OpenAI-compatible chat completion client (chat.OpenAI) sharing the same substrate as embed.OpenAI: retry with Retry-After (capped), classified failures (auth / rate_limit / timeout / server / transport / schema_mismatch / dependency_unavailable), preserved lower-level causes on provider errors, APIToken redaction, custom HTTP client injection, and a product-neutral structured JSON helper
  • hybrid retrieval: dense vector + FTS5, fused via a pluggable FusionStrategy (RRFFusion by default) with per-arm provenance surfaced to downstream rerankers, plus explicit FTS-only SearchLexical for embedder-free fallback paths
  • record-level aggregation over existing chunk search hits (SearchRecords / AggregateSearchHitsByRecord) for selecting records before outline or context expansion
  • optional source spans on stored sections/hits for durable, product-neutral evidence handles over caller-defined units such as pages, lines, bytes, or characters
  • quantization knobs: float32 (default), int8 (4× smaller), binary (1-bit sign-packed vec0 prefilter that is 32× smaller for the prefilter representation; full-precision vectors are retained in a companion table for cosine rescoring, so total snapshot size is not 32× smaller). Binary snapshots stamp a binary_companion_validated_at marker at build/update commit so OpenSnapshot skips the per-row companion-table scan; Snapshot.VerifyBinaryCompanion is the strict-path opt-in for re-validating explicitly.
  • optional Matryoshka prefilter at a truncated dimension with full-dim cosine rescore (SearchParams.SearchDimension)
  • atomic rebuilds and incremental Update / UpdateFromSource / SyncFromSource with embedding reuse at the section level, chaining schema migrations v2 → v3 → v4 → v5 → v6 → v7 → v8 in one transaction. Post-commit validation defaults to IntegrityModeFast (skips whole-database SQLite PRAGMA scans, keeps Stroma-specific completeness checks); set BuildOptions.IntegrityMode / UpdateOptions.IntegrityMode to IntegrityModeFull for the deep integrity_check + foreign_key_check passes when diagnosing corruption.

Stroma is not for:

  • spec governance
  • source discovery or repository scanning
  • code compliance or doc drift analysis
  • prompt templates, system prompts, or semantic interpretation of structured chat responses
  • product-specific adapters and transports

Install

go get github.com/dusk-network/stroma/v3

Packages

  • corpus — canonical record model, NewRecord helper, Normalize, deterministic Fingerprint
  • chunkPolicy interface with MarkdownPolicy, KindRouterPolicy, LateChunkPolicy; MarkdownWithOptions returns ErrTooManySections when a body exceeds the DoS cap
  • embedEmbedder and ContextualEmbedder interfaces; deterministic Fixture; OpenAI-compatible HTTP embedder with MaxBatchSize batching, deadline scaling across batches, custom HTTP client injection, and APIToken redaction in String/GoString/LogValue
  • chat — OpenAI-compatible chat completion client (chat.OpenAI, chat.Message, ChatCompletionText, ChatCompletionJSON); tolerates string and multi-part array content; structured JSON responses decode into caller-owned targets and malformed JSON returns schema_mismatch; custom HTTP client injection; APIToken redaction parity with embed.OpenAIConfig
  • provider — shared HTTP substrate used by embed and chat: retry with capped Retry-After, response-size bounding, negative MaxRetries normalization to zero, and a stable FailureClass taxonomy surfaced via *provider.Error. Callers branch on FailureClass to retry / degrade / propagate, and can unwrap lower-level transport/decode causes where available
  • store — SQLite readiness probes, sqlite-vec readiness, conservative SQLite handle defaults (foreign_keys plus a busy timeout), opt-in SQLite tuning options for library embedders, quantization blob helpers (QuantizationFloat32 / QuantizationInt8 / QuantizationBinary)
  • index — atomic Rebuild plus streaming RebuildFromSource with embedding reuse and explicit reuse diagnostics, incremental Update / UpdateFromSource and full-corpus SyncFromSource with MaxPlannedRecords batching guard, long-lived Snapshot readers, Stats, hybrid Search with provenance and explicit MaxSearchLimit, SearchLexical for FTS-only fallback, SearchRecords for record-level aggregation, Outline for structure reads, ExpandContext for parent/neighbor walks

Retrieval Evidence And Batch Use

Use OpenSnapshot when issuing many searches or reads against one built index. A Snapshot is safe for concurrent reads; callers own the concurrency limit, so use a bounded worker pool or semaphore sized for the host and workload, then close the snapshot after all searches and context expansions finish.

Snapshot.Records and Snapshot.Sections are all-at-once convenience reads for small snapshots. For large exports, inspections, or embedding-heavy section reads, use Snapshot.WalkRecords and Snapshot.WalkSections to process rows through a single-pass callback without materializing the full result set. The walk methods preserve the same filter shape and stable ordering as the slice-returning methods. They are not resumable page APIs: stopping and calling again starts at the first matching row. Return index.ErrStopWalk from the callback to end a walk successfully; keep callbacks quick because the SQLite read cursor stays open while the callback runs and can delay WAL checkpoints.

For durable evidence handles, persist at least:

  • Stats.ContentFingerprint from the opened snapshot, identifying the indexed content generation
  • SearchHit.ChunkID, identifying a chunk only within that snapshot generation
  • SearchHit.Ref plus any caller-needed record metadata or SourceRef
  • SearchHit.SourceSpan / Section.SourceSpan, when present, identifying a caller-defined non-empty half-open source range [Start, End) in a stable unit such as page, line, byte, or char

ChunkID is not a cross-rebuild identity. Before expanding a previously saved hit, reopen the snapshot, compare Stats.ContentFingerprint with the saved value, and rerun search if it differs. SearchHit.Score and HitProvenance are ranking evidence for the query that produced the hit; keep them for audit/debugging, but do not use them as identity fields.

ExpandContext(hit.ChunkID, opts) returns the hit chunk plus requested parent/neighbor sections in document order. On flat snapshots, parent expansion is a no-op and neighbors are same-record chunks. On hierarchical snapshots, parent expansion follows parent_chunk_id one level and neighbors stay in the same sibling group. A missing chunk returns an empty slice and nil error, which lets wrappers treat stale handles as "not found" after they have already checked the content fingerprint.

Use SearchLexical when callers need an embedder-free lexical fallback, for example while an embedding provider is unavailable. Use SearchRecords when the first retrieval step needs a ranked record list rather than individual chunks. It runs the same SearchParams through Snapshot.Search, so kind/ref/metadata filters and fusion behavior are applied before aggregation. SearchRecords fetches a bounded over-sampled chunk shortlist so multi-chunk records do not consume the whole record result budget. The default aggregation groups by Ref, sums contributing chunk scores, preserves each contributing ChunkID and HitProvenance, and breaks ties deterministically by best chunk score, contribution count, then ref. Use AggregateSearchHitsByRecord when the caller already has chunk hits; that standalone helper is linear in the supplied hit count.

Structure-First Retrieval Adapter Pattern

A PageIndex-style retriever should be an adapter built from Stroma primitives, not a second index path inside Stroma. The adapter owns the product choices: which outline nodes to show an agent, which prompt asks it to select nodes, which domain labels appear in tool schemas, and how broad final evidence should be. Stroma owns only the neutral substrate: hierarchy-aware chunking, compact structure reads, hybrid search, record aggregation, source spans, and bounded context expansion.

Build the snapshot with a hierarchy-aware chunk policy when callers need broad section context around small retrievable leaves:

_, err := index.Rebuild(ctx, records, index.BuildOptions{
	Path:     "stroma.db",
	Embedder: embedder,
	// Example token sizes; tune for the corpus and embedder.
	ChunkPolicy: chunk.LateChunkPolicy{
		ParentMaxTokens:    1200,
		ChildMaxTokens:     240,
		ChildOverlapTokens: 40,
	},
})

At query time, keep one Snapshot open and compose the public read APIs:

snap, err := index.OpenSnapshot(ctx, "stroma.db")
if err != nil {
	return err
}
defer func() { _ = snap.Close() }()

records, err := snap.SearchRecords(ctx, index.SnapshotRecordSearchQuery{
	SearchParams: index.SearchParams{
		Text:     query,
		Limit:    40, // chunk shortlist cap before record aggregation
		Embedder: embedder,
	},
	Aggregation: index.RecordAggregationOptions{Limit: 8},
})
if err != nil {
	return err
}
if len(records) == 0 {
	return nil
}

refs := make([]string, 0, len(records))
for _, record := range records {
	refs = append(refs, record.Ref)
}

outline, err := snap.Outline(ctx, index.OutlineQuery{Refs: refs})
if err != nil {
	return err
}
// An adapter may still continue when outline is empty; it owns that policy.

hits, err := snap.Search(ctx, index.SnapshotSearchQuery{
	SearchParams: index.SearchParams{
		Text:     query,
		Limit:    20,
		Refs:     refs,
		Embedder: embedder,
	},
})
if err != nil {
	return err
}
if len(hits) == 0 {
	return nil
}

evidence, err := snap.ExpandContext(ctx, hits[0].ChunkID, index.ContextOptions{
	IncludeParent:  true,
	NeighborWindow: 1,
})
if err != nil {
	return err
}

// outline: compact structure for the adapter/agent to inspect
// evidence: final parent/neighbor context for citations or answers

Use Outline when the adapter needs a compact table of contents before choosing where to spend search or context budget. Use Search when lexical/vector similarity should drive recall directly. Use both when broad semantic recall should select candidate records, then structure should guide which parent or neighbor spans become final evidence. If Outline is unavailable in the caller's supported Stroma range, WalkSections can provide a bounded-memory fallback, but it returns full section content and is therefore a heavier structure view.

Keep prompts, JSON schemas, tool definitions, reranking heuristics, governance semantics, and product-specific source adapters above Stroma. Persist Stats.ContentFingerprint with any saved ChunkID or SourceSpan handles so later expansions can detect stale evidence and rerun retrieval against the current snapshot generation.

Search Filtering

SearchParams.Kinds, SearchParams.Refs, and SearchParams.Metadata restrict candidate records before each retrieval arm ranks and truncates its own shortlist. The vector arm applies those filters inside the vector prefilter stage; when any record filter is present, Stroma scans only chunks that satisfy the record predicates instead of taking a vec0 MATCH k shortlist and filtering it afterward. Ref and kind filters use normal indexed table predicates before vector blobs are scored. The FTS arm applies the same filters in the fts_chunks query before ordering by FTS rank and LIMIT. This avoids under-filled results when a small collection would otherwise be filtered out after unrelated high-ranking chunks consume each arm's candidate window.

Metadata filters are exact string matches against stored record metadata: values within one key are ORed (item_id IN (...)), and multiple keys are ANDed together. Empty kind/ref filter values, empty metadata keys, duplicate metadata keys after trimming, empty metadata value lists, and whitespace-only metadata values reject instead of becoming accidental unfiltered searches. Empty metadata values are valid exact matches. Kinds remains the kind allow-list, and Refs expresses ref IN (...). Current snapshots maintain a generic indexed record_metadata side table for these predicates while preserving records.metadata_json as the canonical payload field exposed through read APIs; older read-only snapshots fall back to JSON-backed evaluation.

SearchParams.OmitMetadata, VectorSearchQuery.OmitMetadata, RecordQuery.OmitMetadata, and SectionQuery.OmitMetadata are an opt-out for callers that do not need the per-row metadata payload. The SQL projection substitutes NULL for records.metadata_json and the per-row json.Unmarshal is skipped, so Metadata arrives as nil instead of an allocated map. Metadata filters still apply because they run inside the SQL plan, not against the returned payload.

Rebuild Streaming, Update Memory And Batching

Use RebuildFromSource when the corpus lives behind a filesystem, database, or other lazy loader and callers should not build one full []corpus.Record before indexing. A RecordSource returns one corpus.Record at a time; Stroma normalizes each record, chunks and embeds records in bounded internal batches, writes them to the staging snapshot, rejects duplicate refs through the snapshot's primary key, and computes the final content fingerprint from persisted (ref, content_hash) pairs. This keeps record bodies bounded to the current source record plus the current planned batch; Stroma retains that batch's chunks/vectors until it is flushed. Rebuild remains the slice convenience API and still sorts the provided records before delegating to the same write path. RebuildFromSource preserves source order, so source order determines snapshot-local ChunkID assignment; callers that need repeatable chunk IDs across streaming rebuilds should emit records in a stable order.

RebuildFromSource keeps the same atomic staging-file contract as Rebuild: a source, chunking, embedding, or duplicate-ref error discards the staged file and leaves the destination snapshot unchanged. It is not a resumable checkpointing API. The staging SQLite transaction stays open while the source is consumed and embeddings are produced, so callers should keep RecordSource.Next responsive to context cancellation and avoid doing unrelated slow work inside it.

Update chunks, contextualizes, reuse-plans, and embeds added/replaced records before opening its SQLite write transaction. That keeps external embedder latency out of the transaction and preserves stale-plan rollback semantics, but the pre-transaction plan retains each added record's chunks, reuse decisions, and new vectors until the write phase.

Use UpdateFromSource when a caller has a source stream and wants Stroma to diff it against the existing snapshot without permitting implicit deletion. Stored refs missing from the stream reject with index.ErrSourceRemovalsDisabled by default, so a partial changed-record feed cannot accidentally delete the rest of the corpus. New refs are added, refs whose normalized Stroma-owned content hash changed are replaced, and unchanged refs are counted as reused without loading their full stored bodies. Use SyncFromSource (or set UpdateOptions.AllowSourceRemovals) when the stream is the complete desired record set and stored refs missing from the stream should be removed. Duplicate source refs and over-cap changed-record plans fail before embedding or opening the write transaction. Added/replaced records preserve source order for their new snapshot-local ChunkID assignment; callers that need repeatable chunk IDs for changed records should emit changed records in a stable order.

UpdateFromSource consumes record bodies one at a time, but it is not a constant-memory full-corpus diff. It keeps the snapshot's (ref, content_hash) pairs and the full source ref set in memory to detect removals and no-ops. It retains only added/replaced record bodies plus their planned chunks/vectors until commit. MaxPlannedRecords caps that changed-record plan and is checked while the source is consumed, so an over-cap source update fails before embedding and before the write handle is opened. The returned UnchangedRecordCount / UnchangedChunkCount identify fully unchanged source records separately from ReusedRecordCount / ReusedChunkCount, which also include section-level embedding reuse inside changed records.

For large ingests, split added records into caller-sized batches and set UpdateOptions.MaxPlannedRecords to that batch size. A batch above the cap fails before embedding and before the write transaction starts with an error wrapping index.ErrUpdatePlanTooLarge, so callers can retry smaller batches without changing the on-disk snapshot. MaxChunkSections still bounds per-record section expansion.

Example

ctx := context.Background()

records := []corpus.Record{
    corpus.NewRecord(
        "widget-overview",
        "Widget Overview",
        "# Overview\n\nWidgets are synchronized in batches.",
    ),
}

fixture, err := embed.NewFixture("fixture-demo", 16)
if err != nil {
    log.Fatal(err)
}

if _, err := index.Rebuild(ctx, records, index.BuildOptions{
    Path:     "stroma.db",
    Embedder: fixture,
}); err != nil {
    log.Fatal(err)
}

hits, err := index.Search(ctx, index.SearchQuery{
    Path: "stroma.db",
    SearchParams: index.SearchParams{
        Text:     "synchronized batches",
        Limit:    5,
        Embedder: fixture,
        // Fusion / Reranker / SearchDimension are optional; zero values
        // give hybrid RRF over vector+FTS with the full stored dimension.
    },
})
if err != nil {
    log.Fatal(err)
}

fmt.Println(hits[0].Ref)

See the v3.0.0 release notes for the full API surface.

Status

v3.0.0 (current) ships the stable substrate with safer source synchronization, lexical-only fallback search, custom HTTP client injection for OpenAI-backed providers, fenced-code-aware Markdown chunking, hybrid retrieval, pluggable fusion, quantization, matryoshka, contextual retrieval, adaptive chunking, and incremental update. Higher-order products should consume the library rather than re-embedding their own indexing substrate.

About

Neutral corpus and indexing substrate for local semantic retrieval

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages