Skip to content

feat(documents): wrap PlainText/PDF/DOCX as adapters and route DocumentParser through the registry#927

Merged
tpae merged 3 commits into
osaurus-ai:mainfrom
mimeding:feat/document-adapter-wrappers
May 6, 2026
Merged

feat(documents): wrap PlainText/PDF/DOCX as adapters and route DocumentParser through the registry#927
tpae merged 3 commits into
osaurus-ai:mainfrom
mimeding:feat/document-adapter-wrappers

Conversation

@mimeding
Copy link
Copy Markdown
Contributor

@mimeding mimeding commented Apr 23, 2026

Rebased onto origin/main after #1015. The diff is now the adapter wrappers plus the DocumentParser registry shim.

Business rationale

Osaurus today ingests every attached file through a single DocumentParser switch. To expand file-format fidelity (XLSX cells with formulas, DOCX tables, PDF tables, MT940/OFX bank exports), that switch has to move onto a pluggable adapter surface — and it has to do so without a user-visible behaviour change, because chat attachments are a core workflow. This PR is the bridge: it migrates the three file types already handled (plain text, PDF, DOCX-family) onto the adapter surface from #926 and proves the DocumentParser shim routes correctly. It deliberately adds zero fidelity; it's the groundwork every later adapter plugs into.

Coding rationale

Three design decisions worth calling out:

  • DocumentParser.parseAll stays synchronous. The adapter protocol is async (parse(url:sizeLimit:) async throws -> StructuredDocument) because streaming adapters land in the next slice. Both existing call sites (FloatingInputCard, ClipboardService) already wrap parseAll in Task.detached, so the sync-to-async bridge inside the shim (a dispatch-semaphore + detached inner Task) blocks a cooperative thread rather than the main thread. Making parseAll itself async would touch every ingress call site and is out of scope for this PR.
  • The PDF image-fallback stays in the legacy switch. PDFAdapter throws DocumentAdapterError.emptyContent when a PDF has no text layer; the shim catches that specific error and falls through to parsePDFWithFallback, which still renders each page as PNG. Moving the image path onto the adapter surface needs a typed PDFDocument representation that isn't in scope until a later fidelity slice.
  • One RichDocumentAdapter covers DOCX, DOC, RTF, RTFD, HTML. They share the same NSAttributedString(url:options:documentAttributes:) call today, so a single adapter keeps the migration minimal. The high-fidelity DOCX reader can split this along format lines later; at that point this adapter becomes the RTF/HTML-only path.

What changed and why

  • Packages/OsaurusCore/Services/Documents/PlainTextAdapter.swift: wraps the existing UTF-8-first / ISO-Latin-1-retry decoding path from DocumentParser.parsePlainText, including the 500K-character truncation marker, so the legacy behaviour stays byte-identical.
  • Packages/OsaurusCore/Services/Documents/PDFAdapter.swift: wraps PDFDocument.page.string extraction; throws .emptyContent for text-layerless PDFs so the shim can fall through.
  • Packages/OsaurusCore/Services/Documents/RichDocumentAdapter.swift: wraps NSAttributedString(url:options:documentAttributes:) across the rich-document formats; one adapter today, splits later.
  • Packages/OsaurusCore/Managers/Documents/DocumentAdaptersBootstrap.swift: idempotent registration entry point.
  • Packages/OsaurusCore/Models/Documents/PlainTextRepresentation.swift: neutral representation for adapters that don't yet publish a format-native shape.
  • Packages/OsaurusCore/Utils/DocumentParser.swift: parseAll now tries the registry first, falls through to the existing switch on .emptyContent / .unsupportedFormat, and translates other DocumentAdapterError cases into the legacy ParseError surface. The translation is conservative — sizeLimitExceeded becomes fileTooLarge, readFailed preserves the underlying reason, anything else becomes a generic read error.
  • Packages/OsaurusCore/AppDelegate.swift: calls DocumentAdaptersBootstrap.registerBuiltIns() near the start of applicationDidFinishLaunching so the shim sees adapters on the first file ingress.
  • Packages/OsaurusCore/Tests/Documents/*: unit tests for each adapter (text decoding, size-limit refusal, empty-content handling, PDF text extraction, PDF without text layer, RTF/HTML extraction) plus integration tests for the shim (routing through a registered fixture adapter, fall-through on empty content, legacy path still works with no adapter registered, bootstrap registers all three built-ins on an isolated registry).

Validation

  • swift build --package-path Packages/OsaurusCore — passed.
  • swift build --package-path Packages/OsaurusCore -c release — passed.
  • swift test --package-path Packages/OsaurusCore — passed: 1,454 tests in 196 suites. Sandbox integration skipped by its documented env gate.
  • xcrun swift-format lint --strict on every touched Swift file — passed.
  • swiftlint lint --strict on every touched Swift file — passed.
  • git diff --check — passed.
  • Packages/OsaurusCLI gate skipped; this PR does not touch CLI code.

Non-scope

  • No new format support. CSV is already plain-text today; it stays that way until the streaming CSV/TSV adapter slice. XLSX is the first real fidelity win.
  • No changes to Attachment.Kind. Chat attachments still decode as .document(filename, content, fileSize); the new .structuredDocument case lands after several real adapters prove the model.
  • No plugin host changes. The plugin host API that lets external plugins register their own adapters is a later slice.
  • No size-limit UI. The per-format ceilings in DocumentLimits are internal defaults; a user-facing "Maximum attached document size" preference is deferred.

Residual risks

  • The sync-to-async bridge in routeThroughRegistry relies on parseAll always being called from a task context, not directly from the main actor. Both existing call sites satisfy this. A comment documents the invariant.
  • DocumentAdaptersBootstrap.didRegisterShared uses the same nonisolated(unsafe) + NSLock pattern as OsaurusPaths.overrideRoot; a future migration to an actor or OSAllocatedUnfairLock would be cleaner but isn't in scope.
  • The runBlocking helper allocates a detached Task per parseAll call. That's one task allocation per file ingress, which is negligible compared to the PDFKit / NSAttributedString work it wraps.

@mimeding
Copy link
Copy Markdown
Contributor Author

Recovery-plan note: this branch is still the base of the document stack. The current test-core failure maps to slowToolReturnsTimeoutEnvelopeBeforeBudgetExpires, and I opened #976 as the narrow fix for the timeout race in ToolRegistry.runToolBody.

Recommended order: land #975 first for the shared CI cache failure, then land #976, then rerun/rebase #927 before continuing the document stack (#929 -> #936 -> #937 -> #939 -> #940 -> #941 -> #942).

@mimeding mimeding force-pushed the feat/document-adapter-wrappers branch from 8e5fd2d to e85bd24 Compare May 1, 2026 02:54
@mimeding mimeding marked this pull request as ready for review May 1, 2026 03:22
@mimeding mimeding force-pushed the feat/document-adapter-wrappers branch 2 times, most recently from c32b6c0 to bf769f9 Compare May 1, 2026 18:28
mimeding and others added 2 commits May 3, 2026 16:53
…ntParser through the registry

Migrates the three ingress paths already handled by DocumentParser onto
the adapter surface introduced in the foundations PR, without changing
any user-observable behaviour. parseAll now consults the registry first
and falls back to its existing switch for anything an adapter hasn't
claimed or has declined — specifically image-only PDFs, which continue
to render via the legacy fallback until the layout-aware PDF rework
lands.

- PlainTextAdapter wraps the existing UTF-8 / ISO-Latin-1 retry path
  and the 500K-character truncation marker so the legacy behaviour
  stays byte-identical.
- PDFAdapter wraps PDFKit text extraction; it throws emptyContent when
  there is no text layer so the shim falls through to the legacy image-
  render path rather than claiming a result it cannot produce.
- RichDocumentAdapter wraps NSAttributedString across docx/doc/rtf/html;
  a single adapter for all four because they share the framework call
  today, splitting when high-fidelity DOCX lands.
- DocumentAdaptersBootstrap registers the three on the shared registry
  from AppDelegate.applicationDidFinishLaunching exactly once so the
  shim sees adapters on the first file ingress.
- PlainTextRepresentation is the neutral text shape for adapters that
  cannot yet publish a format-native representation; replaced per-format
  by Workbook / WordDocument / etc. in later PRs.
Business rationale: Keeping the document-adapter stack green after the main test-gate stabilization protects the file-fidelity harness work from stale CI failures and gives reviewers a clean signal on the actual adapter behavior.

Coding rationale: This commit only reshapes lint-triggering lines introduced by the rebase path, preserving the adapter behavior while aligning the touched files with swift-format and swiftlint. The alternative was to suppress rules locally, but small expression rewrites keep the code easier to maintain.

Co-authored-by: Michael Meding <mmeding@Michaels-Mac-Studio.local>
@mimeding mimeding force-pushed the feat/document-adapter-wrappers branch from bf769f9 to 0d80196 Compare May 3, 2026 20:27
@tpae tpae merged commit 59ca610 into osaurus-ai:main May 6, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants