Skip to content

feat(documents): foundations for document format adapters#926

Merged
tpae merged 6 commits into
osaurus-ai:mainfrom
mimeding:feat/document-format-foundations
Apr 28, 2026
Merged

feat(documents): foundations for document format adapters#926
tpae merged 6 commits into
osaurus-ai:mainfrom
mimeding:feat/document-format-foundations

Conversation

@mimeding
Copy link
Copy Markdown
Contributor

@mimeding mimeding commented Apr 23, 2026

Business rationale

Osaurus today handles attached files through a flat ingestion path that turns rich content into plain text. That is enough for some quick answers, but it does not give Osaurus a clean boundary between minimal Core parsing and richer document workflows implemented by plugins.

This PR adds only the Core substrate for that boundary: document adapter contracts, a typed parse result with text fallback, common size/error limits, and a process-wide registry. It does not add high-fidelity format implementations, editing capability, generated output, or domain-specific behavior.

The review direction is that Core should stay small: document parsing for LLM quick answers belongs in Core; editing, generation, and high-fidelity format or domain workflows belong in plugins such as XLSX, PPTX, vision, finance, engineering, and similar specialty packages. This PR supports that split by giving plugins stable registration points instead of forcing every format into Osaurus Core.

Core / plugin boundary

Core owns only the minimal functionality needed for safe quick-answer ingestion:

  • Stable parser/adapter contracts.
  • Typed parse result and text fallback.
  • Shared size limits and error types.
  • Registry/routing for Core and plugin-provided parsers.
  • Compatibility with the existing attachment read path.

Plugins own everything beyond that minimal ingestion layer:

  • Editing and generated output.
  • XLSX/PPTX workflows.
  • OCR, vision, and layout-heavy extraction.
  • High-fidelity spreadsheet, PDF, CAD, GIS, geoscience, finance, marketing, statistics, and other domain parsing.
  • Domain-specific agent tools, dashboards, reports, and writeback.
  • Large optional dependencies and format-specific UX.

The DocumentFormatEmitter protocol is included only as a neutral boundary contract so plugin-provided emitters can have a typed interface later. It is not a Core commitment to ship editing or emission implementations. If maintainers prefer, emitter support can remain unused until the plugin bridge PRs land.

Coding rationale

Three things had to be decided up front:

  • Protocols are split by responsibility. DocumentFormatAdapter reads, DocumentFormatEmitter writes, DocumentFormatStreamer streams. Splitting them keeps Core ingestion small while allowing plugins to opt into richer behavior without forcing stub writers or streamers onto every format.
  • The registry is lock-based, not actor-isolated. The agent tool surface can run off the main actor and should not pay an await hop per file lookup. An NSLock on an @unchecked Sendable final class keeps lookup cheap and lets tests run concurrent registrations without plumbing Task isolation into the fixtures.
  • StructuredDocument carries a text fallback. Existing chat attachment code consumes Attachment.Kind.document(content: String, ...). Keeping a text view lets Core preserve current quick-answer behavior while plugins attach richer structured representations when available.

This PR is scoped to protocols, registry, representation, limits, and contract tests. The follow-up Core work should be limited to routing the existing read path through this registry and preserving current parsing behavior. High-fidelity adapters and editing workflows should move to plugin-first PRs.

What changed and why

  • Packages/OsaurusCore/Models/Documents/DocumentFormatAdapter.swift: read-side protocol with canHandle kept separate from parse so the registry can enumerate candidates without paying parse cost on mismatches.
  • Packages/OsaurusCore/Models/Documents/DocumentFormatEmitter.swift: write-side protocol split from the adapter so emit/write behavior can remain plugin-owned.
  • Packages/OsaurusCore/Models/Documents/DocumentFormatStreamer.swift: optional streaming surface for formats where whole-file-into-memory is not viable, primarily for plugin adapters.
  • Packages/OsaurusCore/Models/Documents/StructuredDocument.swift: typed parse result with textFallback, plus StructuredRepresentation marker and AnyStructuredRepresentation type eraser so richer per-format data can stay behind adapter boundaries.
  • Packages/OsaurusCore/Models/Documents/DocumentAdapterError.swift: shared error surface (unsupportedFormat, sizeLimitExceeded, readFailed, writeFailed, emptyContent, cancelled) so Core and plugins return consistent failures.
  • Packages/OsaurusCore/Models/Documents/DocumentLimits.swift: per-format byte ceilings with a conservative default; the adversarial-input rationale mirrors the file-size gate landed in fix(security): harden import/export trust boundaries for attachments and shared artifacts #925.
  • Packages/OsaurusCore/Managers/Documents/DocumentFormatRegistry.swift: process-wide routing table with shared singleton, lock-serialised registration, later-wins lookup so plugins can override Core defaults, and unregisterAll(formatId:) for plugin unload.
  • Packages/OsaurusCore/Tests/Documents/DocumentFormatRegistryTests.swift: contract tests for routing, later-wins behavior, unregister, emitter selection, streamer lookup, and concurrent registration.

Validation

  • swift test --filter 'DocumentFormatRegistry' — 7 of 7 passing.
  • swift test (full OsaurusCore package) — 890 of 890 passing.
  • xcrun swift-format lint --strict on every new file — clean.
  • swiftlint on every new file — clean.
  • swift build --package-path Packages/OsaurusCore — succeeds.
  • GitHub CI is green on this PR.

Follow-up plan

Follow-up work should be reshaped around the stricter boundary:

  • Core follow-ups should be limited to the registry-backed read shim, compatibility with existing quick-answer parsing, and the plugin adapter registration bridge.
  • Existing draft PRs that add high-fidelity XLSX, PDF table extraction, workbook write tools, emitters, or domain behavior should be moved or redesigned as plugin-first work.
  • Structured attachment surfacing should come after the minimal Core read path and plugin bridge are stable.

Non-scope

  • No behavior change to current attachment parsing. DocumentParser.parseAll still owns every file-ingress path today.
  • No high-fidelity adapters in this PR.
  • No XLSX/PPTX/vision/domain implementation in this PR.
  • No editing, mutation, generated output, writeback, or artifact authoring in Core.
  • No plugin host change in this PR. Plugin-provided adapters ride later registration/API PRs.
  • No new Attachment.Kind case in this PR.

Residual risks

  • DocumentFormatStreamer has an associatedtype Element; the registry stores streamers keyed on formatId and returns them via any DocumentFormatStreamer. Callers that want the typed element have to know the concrete streamer type and cast. This is acceptable because specialty plugin tools should already know which format they invoke.
  • AnyStructuredRepresentation is @unchecked Sendable because Swift cannot prove the erased any StructuredRepresentation conforms. Every concrete representation must keep its Sendable promise.
  • Later-wins registration means a plugin can override a Core default. That is intentional for plugin extensibility, but plugin lifecycle tests should pin cleanup/unload behavior in the bridge PR.

Adds the read-side, write-side, and streaming protocols plus a
process-wide routing registry, with a typed parse result that carries
both a format-native representation and a plain-text fallback. No
existing code is touched — the shim that delegates DocumentParser.parseAll
to the registry, and the first real adapters, land in follow-up PRs so
each slice stays independently reviewable.

- Protocols split by responsibility so read-only formats (XLS, OFX 1.x)
  don't carry stub writers and small formats (QIF, MT940) don't carry
  stub streamers.
- Registry uses an NSLock rather than @mainactor isolation so agent
  tool calls (which will run off the main actor) can look up an adapter
  without an await hop.
- Per-format size ceilings collected in DocumentLimits; the adversarial-
  input argument for these caps mirrors the file-size gate landed in
  osaurus-ai#925 for the legacy DocumentParser.
@mimeding
Copy link
Copy Markdown
Contributor Author

@tpae when you have time for a larger review — this is the bottom of a 9-PR stage-4 stack that builds out high-fidelity business-file ingestion / emission (XLSX, CSV, PDF tables, agent tools, plugin-adapter surface, chat attachment). #926 adds only the protocols + registry + size-limits + error types; every slice above stacks on it cleanly and each has its own review-sized scope.

Stack map — all test-core green, all draft until #926 lands so reviewers aren't asked to look at stacked code before the bottom is settled:

PR Title LOC (prod + tests)
#926 Foundations: protocols + registry + StructuredDocument ~330 + ~135
#927 PlainText / PDF / DOCX wrappers + DocumentParser shim ~490 + ~265
#929 XLSX read via CoreXLSX → typed Workbook ~295 + ~155
#936 XLSX write via libxlsxwriter (round-trip closes here) ~260 + ~240
#937 read_workbook / read_workbook_cell / write_workbook agent tools ~590 + ~170
#939 CSV / TSV streaming adapter with typed CSVTable ~470 + ~290
#940 Layout-aware PDF table extraction ~240 + ~240
#941 Plugin register_parser / register_emitter surface ~270 + ~315
#942 Attachment.Kind.structuredDocument case + Codable back-compat ~250 + ~180

Each PR body has a full business + coding rationale with explicit non-scope and residual-risk sections. Happy to walk through any slice, answer design questions, or split a PR further if review size is a concern. Foundations-only is small, self-contained, and touches no existing code.

@tpae
Copy link
Copy Markdown
Contributor

tpae commented Apr 27, 2026

Hello @mimeding , thank you for your contributions.

Sorry about the delays in reviewing this PR. I believe this will be a great addition to Osaurus. I just have a few concerns, and want to hear your thoughts on the future design.

We have these features as separate plugins:

I'm wondering what the best way for us to handle this? My concern is, how much features to add to Osaurus Core and what would be best to be exposed as a plugin (or added to the plugins).

Here's my thoughts:

  • Document parsing (ingestion for LLM to read) would be an excellent core feature and functionality (for quick answers).
  • Editing capabilities should belong to a plugin.

Happy to hear your overall design thoughts on how to best separate and also implement as part of core functionality.

@mimeding
Copy link
Copy Markdown
Contributor Author

@tpae thanks, I agree with that boundary.

My proposed split is:

  • Core should own document ingestion primitives: typed document representation, parser/adapter contracts, size/error limits, registry/routing, and the read path that lets the LLM quickly understand attached files.
  • Core can include small read-only adapters where they replace or wrap existing behavior and are useful for quick answers.
  • Plugins should own editing/emission-heavy workflows: workbook mutation, generated XLSX/PPTX output, complex document authoring, and format-specific tool UX.
  • Plugins should also own heavier or optional capabilities such as advanced spreadsheet operations, PPTX generation, OCR/vision pipelines, and domain-specific extractors.

That means I think #926 is still the right foundation to land in Core: it adds only the neutral contracts and registry, no real adapter behavior and no new user-facing editing capability. It actually makes the plugin boundary cleaner because plugins can register parsers/emitters without hard-coding every format into Core.

For the stack after #926, I’m happy to adjust the split:

So my preference is: merge the neutral foundation first, then decide adapter-by-adapter whether each follow-up belongs in Core or a plugin before taking it out of draft.

@mimeding
Copy link
Copy Markdown
Contributor Author

@tpae I reread your note and tightened the PR body to match the stricter interpretation: Core should carry only the minimal ingestion substrate for quick answers, while richer XLSX/PPTX/vision/domain parsing, editing, emission, and generated-output workflows should be plugin-first.

So the intended Core scope for #926 is now only the neutral contracts/registry/limits/error surface. The follow-up stack should be reshaped before coming out of draft: keep only the registry-backed read shim and plugin adapter bridge in Core, and move high-fidelity adapters or write/edit tools into plugins.

@mimeding
Copy link
Copy Markdown
Contributor Author

@tpae I have now realigned the follow-up plan around your Core/plugin split.

The revised direction is:

So the public stack will be reshaped before more PRs come out of draft: Core stays minimal, and high-fidelity format/domain behavior moves into plugins.

tpae
tpae previously approved these changes Apr 28, 2026
@mimeding mimeding force-pushed the feat/document-format-foundations branch from aec516a to 2000b8c Compare April 28, 2026 21:38
@tpae tpae merged commit 0125813 into osaurus-ai:main Apr 28, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants