feat(documents): foundations for document format adapters#926
Conversation
Adds the read-side, write-side, and streaming protocols plus a process-wide routing registry, with a typed parse result that carries both a format-native representation and a plain-text fallback. No existing code is touched — the shim that delegates DocumentParser.parseAll to the registry, and the first real adapters, land in follow-up PRs so each slice stays independently reviewable. - Protocols split by responsibility so read-only formats (XLS, OFX 1.x) don't carry stub writers and small formats (QIF, MT940) don't carry stub streamers. - Registry uses an NSLock rather than @mainactor isolation so agent tool calls (which will run off the main actor) can look up an adapter without an await hop. - Per-format size ceilings collected in DocumentLimits; the adversarial- input argument for these caps mirrors the file-size gate landed in osaurus-ai#925 for the legacy DocumentParser.
b4fe121 to
7dceeb5
Compare
|
@tpae when you have time for a larger review — this is the bottom of a 9-PR stage-4 stack that builds out high-fidelity business-file ingestion / emission (XLSX, CSV, PDF tables, agent tools, plugin-adapter surface, chat attachment). #926 adds only the protocols + registry + size-limits + error types; every slice above stacks on it cleanly and each has its own review-sized scope. Stack map — all test-core green, all draft until #926 lands so reviewers aren't asked to look at stacked code before the bottom is settled:
Each PR body has a full business + coding rationale with explicit non-scope and residual-risk sections. Happy to walk through any slice, answer design questions, or split a PR further if review size is a concern. Foundations-only is small, self-contained, and touches no existing code. |
|
Hello @mimeding , thank you for your contributions. Sorry about the delays in reviewing this PR. I believe this will be a great addition to Osaurus. I just have a few concerns, and want to hear your thoughts on the future design. We have these features as separate plugins:
I'm wondering what the best way for us to handle this? My concern is, how much features to add to Osaurus Core and what would be best to be exposed as a plugin (or added to the plugins). Here's my thoughts:
Happy to hear your overall design thoughts on how to best separate and also implement as part of core functionality. |
|
@tpae thanks, I agree with that boundary. My proposed split is:
That means I think #926 is still the right foundation to land in Core: it adds only the neutral contracts and registry, no real adapter behavior and no new user-facing editing capability. It actually makes the plugin boundary cleaner because plugins can register parsers/emitters without hard-coding every format into Core. For the stack after #926, I’m happy to adjust the split:
So my preference is: merge the neutral foundation first, then decide adapter-by-adapter whether each follow-up belongs in Core or a plugin before taking it out of draft. |
|
@tpae I reread your note and tightened the PR body to match the stricter interpretation: Core should carry only the minimal ingestion substrate for quick answers, while richer XLSX/PPTX/vision/domain parsing, editing, emission, and generated-output workflows should be plugin-first. So the intended Core scope for #926 is now only the neutral contracts/registry/limits/error surface. The follow-up stack should be reshaped before coming out of draft: keep only the registry-backed read shim and plugin adapter bridge in Core, and move high-fidelity adapters or write/edit tools into plugins. |
|
@tpae I have now realigned the follow-up plan around your Core/plugin split. The revised direction is:
So the public stack will be reshaped before more PRs come out of draft: Core stays minimal, and high-fidelity format/domain behavior moves into plugins. |
aec516a to
2000b8c
Compare
Business rationale
Osaurus today handles attached files through a flat ingestion path that turns rich content into plain text. That is enough for some quick answers, but it does not give Osaurus a clean boundary between minimal Core parsing and richer document workflows implemented by plugins.
This PR adds only the Core substrate for that boundary: document adapter contracts, a typed parse result with text fallback, common size/error limits, and a process-wide registry. It does not add high-fidelity format implementations, editing capability, generated output, or domain-specific behavior.
The review direction is that Core should stay small: document parsing for LLM quick answers belongs in Core; editing, generation, and high-fidelity format or domain workflows belong in plugins such as XLSX, PPTX, vision, finance, engineering, and similar specialty packages. This PR supports that split by giving plugins stable registration points instead of forcing every format into Osaurus Core.
Core / plugin boundary
Core owns only the minimal functionality needed for safe quick-answer ingestion:
Plugins own everything beyond that minimal ingestion layer:
The
DocumentFormatEmitterprotocol is included only as a neutral boundary contract so plugin-provided emitters can have a typed interface later. It is not a Core commitment to ship editing or emission implementations. If maintainers prefer, emitter support can remain unused until the plugin bridge PRs land.Coding rationale
Three things had to be decided up front:
DocumentFormatAdapterreads,DocumentFormatEmitterwrites,DocumentFormatStreamerstreams. Splitting them keeps Core ingestion small while allowing plugins to opt into richer behavior without forcing stub writers or streamers onto every format.awaithop per file lookup. AnNSLockon an@unchecked Sendablefinal class keeps lookup cheap and lets tests run concurrent registrations without plumbing Task isolation into the fixtures.StructuredDocumentcarries a text fallback. Existing chat attachment code consumesAttachment.Kind.document(content: String, ...). Keeping a text view lets Core preserve current quick-answer behavior while plugins attach richer structured representations when available.This PR is scoped to protocols, registry, representation, limits, and contract tests. The follow-up Core work should be limited to routing the existing read path through this registry and preserving current parsing behavior. High-fidelity adapters and editing workflows should move to plugin-first PRs.
What changed and why
Packages/OsaurusCore/Models/Documents/DocumentFormatAdapter.swift: read-side protocol withcanHandlekept separate fromparseso the registry can enumerate candidates without paying parse cost on mismatches.Packages/OsaurusCore/Models/Documents/DocumentFormatEmitter.swift: write-side protocol split from the adapter so emit/write behavior can remain plugin-owned.Packages/OsaurusCore/Models/Documents/DocumentFormatStreamer.swift: optional streaming surface for formats where whole-file-into-memory is not viable, primarily for plugin adapters.Packages/OsaurusCore/Models/Documents/StructuredDocument.swift: typed parse result withtextFallback, plusStructuredRepresentationmarker andAnyStructuredRepresentationtype eraser so richer per-format data can stay behind adapter boundaries.Packages/OsaurusCore/Models/Documents/DocumentAdapterError.swift: shared error surface (unsupportedFormat,sizeLimitExceeded,readFailed,writeFailed,emptyContent,cancelled) so Core and plugins return consistent failures.Packages/OsaurusCore/Models/Documents/DocumentLimits.swift: per-format byte ceilings with a conservative default; the adversarial-input rationale mirrors the file-size gate landed in fix(security): harden import/export trust boundaries for attachments and shared artifacts #925.Packages/OsaurusCore/Managers/Documents/DocumentFormatRegistry.swift: process-wide routing table withsharedsingleton, lock-serialised registration, later-wins lookup so plugins can override Core defaults, andunregisterAll(formatId:)for plugin unload.Packages/OsaurusCore/Tests/Documents/DocumentFormatRegistryTests.swift: contract tests for routing, later-wins behavior, unregister, emitter selection, streamer lookup, and concurrent registration.Validation
swift test --filter 'DocumentFormatRegistry'— 7 of 7 passing.swift test(fullOsaurusCorepackage) — 890 of 890 passing.xcrun swift-format lint --stricton every new file — clean.swiftlinton every new file — clean.swift build --package-path Packages/OsaurusCore— succeeds.Follow-up plan
Follow-up work should be reshaped around the stricter boundary:
Non-scope
DocumentParser.parseAllstill owns every file-ingress path today.Attachment.Kindcase in this PR.Residual risks
DocumentFormatStreamerhas anassociatedtype Element; the registry stores streamers keyed onformatIdand returns them viaany DocumentFormatStreamer. Callers that want the typed element have to know the concrete streamer type and cast. This is acceptable because specialty plugin tools should already know which format they invoke.AnyStructuredRepresentationis@unchecked Sendablebecause Swift cannot prove the erasedany StructuredRepresentationconforms. Every concrete representation must keep itsSendablepromise.