feat(documents): wrap PlainText/PDF/DOCX as adapters and route DocumentParser through the registry#927
Merged
Conversation
This was referenced Apr 23, 2026
709b659 to
c80161a
Compare
This was referenced Apr 24, 2026
27aa9d1 to
8e48270
Compare
Contributor
Author
|
Recovery-plan note: this branch is still the base of the document stack. The current Recommended order: land #975 first for the shared CI cache failure, then land #976, then rerun/rebase #927 before continuing the document stack (#929 -> #936 -> #937 -> #939 -> #940 -> #941 -> #942). |
8e5fd2d to
e85bd24
Compare
c32b6c0 to
bf769f9
Compare
…ntParser through the registry Migrates the three ingress paths already handled by DocumentParser onto the adapter surface introduced in the foundations PR, without changing any user-observable behaviour. parseAll now consults the registry first and falls back to its existing switch for anything an adapter hasn't claimed or has declined — specifically image-only PDFs, which continue to render via the legacy fallback until the layout-aware PDF rework lands. - PlainTextAdapter wraps the existing UTF-8 / ISO-Latin-1 retry path and the 500K-character truncation marker so the legacy behaviour stays byte-identical. - PDFAdapter wraps PDFKit text extraction; it throws emptyContent when there is no text layer so the shim falls through to the legacy image- render path rather than claiming a result it cannot produce. - RichDocumentAdapter wraps NSAttributedString across docx/doc/rtf/html; a single adapter for all four because they share the framework call today, splitting when high-fidelity DOCX lands. - DocumentAdaptersBootstrap registers the three on the shared registry from AppDelegate.applicationDidFinishLaunching exactly once so the shim sees adapters on the first file ingress. - PlainTextRepresentation is the neutral text shape for adapters that cannot yet publish a format-native representation; replaced per-format by Workbook / WordDocument / etc. in later PRs.
Business rationale: Keeping the document-adapter stack green after the main test-gate stabilization protects the file-fidelity harness work from stale CI failures and gives reviewers a clean signal on the actual adapter behavior. Coding rationale: This commit only reshapes lint-triggering lines introduced by the rebase path, preserving the adapter behavior while aligning the touched files with swift-format and swiftlint. The alternative was to suppress rules locally, but small expression rewrites keep the code easier to maintain. Co-authored-by: Michael Meding <mmeding@Michaels-Mac-Studio.local>
bf769f9 to
0d80196
Compare
tpae
approved these changes
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rebased onto
origin/mainafter #1015. The diff is now the adapter wrappers plus theDocumentParserregistry shim.Business rationale
Osaurus today ingests every attached file through a single
DocumentParserswitch. To expand file-format fidelity (XLSX cells with formulas, DOCX tables, PDF tables, MT940/OFX bank exports), that switch has to move onto a pluggable adapter surface — and it has to do so without a user-visible behaviour change, because chat attachments are a core workflow. This PR is the bridge: it migrates the three file types already handled (plain text, PDF, DOCX-family) onto the adapter surface from #926 and proves theDocumentParsershim routes correctly. It deliberately adds zero fidelity; it's the groundwork every later adapter plugs into.Coding rationale
Three design decisions worth calling out:
DocumentParser.parseAllstays synchronous. The adapter protocol is async (parse(url:sizeLimit:) async throws -> StructuredDocument) because streaming adapters land in the next slice. Both existing call sites (FloatingInputCard,ClipboardService) already wrapparseAllinTask.detached, so the sync-to-async bridge inside the shim (a dispatch-semaphore + detached inner Task) blocks a cooperative thread rather than the main thread. MakingparseAllitself async would touch every ingress call site and is out of scope for this PR.PDFAdapterthrowsDocumentAdapterError.emptyContentwhen a PDF has no text layer; the shim catches that specific error and falls through toparsePDFWithFallback, which still renders each page as PNG. Moving the image path onto the adapter surface needs a typedPDFDocumentrepresentation that isn't in scope until a later fidelity slice.RichDocumentAdaptercovers DOCX, DOC, RTF, RTFD, HTML. They share the sameNSAttributedString(url:options:documentAttributes:)call today, so a single adapter keeps the migration minimal. The high-fidelity DOCX reader can split this along format lines later; at that point this adapter becomes the RTF/HTML-only path.What changed and why
Packages/OsaurusCore/Services/Documents/PlainTextAdapter.swift: wraps the existing UTF-8-first / ISO-Latin-1-retry decoding path fromDocumentParser.parsePlainText, including the 500K-character truncation marker, so the legacy behaviour stays byte-identical.Packages/OsaurusCore/Services/Documents/PDFAdapter.swift: wrapsPDFDocument.page.stringextraction; throws.emptyContentfor text-layerless PDFs so the shim can fall through.Packages/OsaurusCore/Services/Documents/RichDocumentAdapter.swift: wrapsNSAttributedString(url:options:documentAttributes:)across the rich-document formats; one adapter today, splits later.Packages/OsaurusCore/Managers/Documents/DocumentAdaptersBootstrap.swift: idempotent registration entry point.Packages/OsaurusCore/Models/Documents/PlainTextRepresentation.swift: neutral representation for adapters that don't yet publish a format-native shape.Packages/OsaurusCore/Utils/DocumentParser.swift:parseAllnow tries the registry first, falls through to the existing switch on.emptyContent/.unsupportedFormat, and translates otherDocumentAdapterErrorcases into the legacyParseErrorsurface. The translation is conservative —sizeLimitExceededbecomesfileTooLarge,readFailedpreserves the underlying reason, anything else becomes a generic read error.Packages/OsaurusCore/AppDelegate.swift: callsDocumentAdaptersBootstrap.registerBuiltIns()near the start ofapplicationDidFinishLaunchingso the shim sees adapters on the first file ingress.Packages/OsaurusCore/Tests/Documents/*: unit tests for each adapter (text decoding, size-limit refusal, empty-content handling, PDF text extraction, PDF without text layer, RTF/HTML extraction) plus integration tests for the shim (routing through a registered fixture adapter, fall-through on empty content, legacy path still works with no adapter registered, bootstrap registers all three built-ins on an isolated registry).Validation
swift build --package-path Packages/OsaurusCore— passed.swift build --package-path Packages/OsaurusCore -c release— passed.swift test --package-path Packages/OsaurusCore— passed: 1,454 tests in 196 suites. Sandbox integration skipped by its documented env gate.xcrun swift-format lint --stricton every touched Swift file — passed.swiftlint lint --stricton every touched Swift file — passed.git diff --check— passed.Packages/OsaurusCLIgate skipped; this PR does not touch CLI code.Non-scope
Attachment.Kind. Chat attachments still decode as.document(filename, content, fileSize); the new.structuredDocumentcase lands after several real adapters prove the model.DocumentLimitsare internal defaults; a user-facing "Maximum attached document size" preference is deferred.Residual risks
routeThroughRegistryrelies onparseAllalways being called from a task context, not directly from the main actor. Both existing call sites satisfy this. A comment documents the invariant.DocumentAdaptersBootstrap.didRegisterShareduses the samenonisolated(unsafe) + NSLockpattern asOsaurusPaths.overrideRoot; a future migration to an actor orOSAllocatedUnfairLockwould be cleaner but isn't in scope.runBlockinghelper allocates a detachedTaskperparseAllcall. That's one task allocation per file ingress, which is negligible compared to the PDFKit /NSAttributedStringwork it wraps.