Skip to content

feat(chat): Attachment.Kind.structuredDocument surfaces typed documents#942

Draft
mimeding wants to merge 12 commits into
osaurus-ai:mainfrom
mimeding:feat/structured-document-attachment
Draft

feat(chat): Attachment.Kind.structuredDocument surfaces typed documents#942
mimeding wants to merge 12 commits into
osaurus-ai:mainfrom
mimeding:feat/structured-document-attachment

Conversation

@mimeding
Copy link
Copy Markdown
Contributor

@mimeding mimeding commented Apr 24, 2026

Stacked on #941, which carries #940/#939/#937/#936/#929/#927 until the lower document stack merges. This branch has been rebased onto current origin/main (fd2ffece, #1015). Reviewers should focus on structured chat attachments; stale unrelated CI/tool-timeout commits were intentionally dropped during rebase.

Business rationale

Structured workbook, table, and document representations only matter if they can move through chat without collapsing back to plain text. Surfacing Attachment.Kind.structuredDocument lets the agent preserve document semantics across the harness, so file-fidelity work becomes usable in ordinary conversations instead of staying trapped inside parser internals.

Coding rationale

The attachment enum grows with a typed structured-document case rather than overloading existing file/document cases, keeping previews and raw files separate from parsed semantic payloads. Spillover follows the existing attachment persistence path so large structured payloads do not bloat chat JSON. The structured attachment path composes with the document registry stack already present in this branch; it does not add format-specific branches to chat code. Rebase cleanup is limited to touched-file lint fixes that were already proven in lower document PRs, plus the one UI switch site that now routes structured documents through the existing document chip.

What changed

Validation

  • git fetch origin && git rebase origin/main - completed after resolving Package.resolved and the standard FolderTools.swift stack conflict; final pre-push rebase reported the branch up to date.
  • swift build --package-path Packages/OsaurusCore - passed.
  • swift build --package-path Packages/OsaurusCore -c release - passed. A first release-build attempt failed with No space left on device; after removing only disposable .build artifacts from older completed worktrees, the same command passed.
  • swift test --package-path Packages/OsaurusCore - passed: 1,523 tests in 204 suites, with sandbox integration tests skipped by their normal environment gate.
  • xcrun swift-format lint --strict on every touched Swift file - passed.
  • swiftlint lint --strict on every touched Swift file - passed file-by-file.
  • git diff --check origin/main...HEAD - passed.
  • CLI gate skipped because this slice does not touch Packages/OsaurusCLI.
  • Workspace/Xcode build skipped because this slice does not touch Xcode targets or project settings.

Non-scope

Residual risks

Reviewers will see lower document-stack commits in the GitHub diff until #927/#929/#936/#937/#939/#940/#941 merge. The structured attachment case depends on those typed document representations retaining their current JSON shape.

@mimeding mimeding force-pushed the feat/structured-document-attachment branch 3 times, most recently from 59412bd to 0202b7a Compare April 28, 2026 22:07
@mimeding mimeding force-pushed the feat/structured-document-attachment branch from 0202b7a to f748be9 Compare May 1, 2026 02:54
@mimeding mimeding marked this pull request as ready for review May 1, 2026 03:22
@mimeding mimeding force-pushed the feat/structured-document-attachment branch from f748be9 to 3cf598d Compare May 1, 2026 04:40
mimeding and others added 12 commits May 3, 2026 21:00
…ntParser through the registry

Migrates the three ingress paths already handled by DocumentParser onto
the adapter surface introduced in the foundations PR, without changing
any user-observable behaviour. parseAll now consults the registry first
and falls back to its existing switch for anything an adapter hasn't
claimed or has declined — specifically image-only PDFs, which continue
to render via the legacy fallback until the layout-aware PDF rework
lands.

- PlainTextAdapter wraps the existing UTF-8 / ISO-Latin-1 retry path
  and the 500K-character truncation marker so the legacy behaviour
  stays byte-identical.
- PDFAdapter wraps PDFKit text extraction; it throws emptyContent when
  there is no text layer so the shim falls through to the legacy image-
  render path rather than claiming a result it cannot produce.
- RichDocumentAdapter wraps NSAttributedString across docx/doc/rtf/html;
  a single adapter for all four because they share the framework call
  today, splitting when high-fidelity DOCX lands.
- DocumentAdaptersBootstrap registers the three on the shared registry
  from AppDelegate.applicationDidFinishLaunching exactly once so the
  shim sees adapters on the first file ingress.
- PlainTextRepresentation is the neutral text shape for adapters that
  cannot yet publish a format-native representation; replaced per-format
  by Workbook / WordDocument / etc. in later PRs.
First real-fidelity document adapter. Reads .xlsx into a typed Workbook
representation carrying sheet names, cells with formula source strings,
merged-range references, shared strings, and cell types (number, shared
string, inline string, boolean). The text fallback renders each sheet
as a tab-separated table so callers still on the legacy Attachment.
Kind.document path see something readable.

The adapter deliberately does NOT call CoreXLSX's parseStyles() — that
entry point crashes on openpyxl-generated workbooks because the
library's PatternFill.patternType is non-optional while Excel's default
empty pattern omits the attribute. Everything we surface today is
style-independent; lifting that limitation (number formats, column
widths, dates stored as styled numbers) lives in a follow-up slice
behind a hand-rolled styles fallback.

- Package.swift: CoreXLSX 0.14.2 dependency for the core target,
  testTarget resource declaration for the xlsxwriter-produced fixture.
- Workbook / Sheet / Row / Cell / CellValue / CellRange: the typed
  intermediate that both the XLSX read path and the eventual XLSX write
  emitter round-trip through.
- XLSXAdapter: the actual CoreXLSX → Workbook translator + markdown-
  style text fallback.
- DocumentAdaptersBootstrap: registers XLSXAdapter alongside PlainText /
  PDF / RichDocument, so DocumentParser.parseAll now routes .xlsx
  through the registry instead of throwing unsupportedFormat.
- Tests/Documents/Fixtures/xlsx/sample.xlsx: 5.9 KB fixture with two
  sheets, a SUM formula, a merged range (A5:B5), shared strings, and
  explicit booleans. Exercises the parse paths for each fidelity feature.
- XLSXAdapterTests: 7 tests pinning format routing, sheet/cell
  structure, formulas, merged ranges, shared strings, booleans, text
  fallback formatting, and size-limit refusal.
- DocumentParserShimTests: expands the bootstrap assertion to include
  "xlsx" alongside the three existing adapter ids.
Pairs with XLSXAdapter so agents can ingest a workbook, modify the
typed Workbook in-process, and emit it back as a fresh .xlsx attachment.
libxlsxwriter ships a first-party Swift Package as a pure C SwiftPM
target, so no XCFramework / vendored C source is needed in osaurus
itself — it's just a dependency add.

- Package.swift: libxlsxwriter 1.2.4 dependency for the core target.
- XLSXEmitter: Workbook -> .xlsx via libxlsxwriter. Parses A1 cell
  references into 0-indexed row/col, dispatches strings / numbers /
  booleans / formulas to the right write_* function, handles merged
  ranges via worksheet_merge_range with a nil string so the top-left
  cell's already-written content is preserved. Cleans up a partial
  .xlsx on any emit error so a failed round trip never masquerades as
  a readable file.
- DocumentAdaptersBootstrap: registers XLSXEmitter alongside XLSXAdapter.
- XLSXEmitterTests: 7 tests pinning the round trip end-to-end. Builds
  a Workbook in memory, writes via XLSXEmitter, reads via XLSXAdapter,
  asserts sheet names / formulas / merged ranges / strings / numbers /
  booleans all survive.

Licensing footnote: libxlsxwriter is BSD-2-Clause, but bundles
third_party/tmpfileplus/tmpfileplus.c under MPL 2.0. Statically linking
is permitted. A follow-up to AcknowledgementsView should list both;
deliberately out of scope for this PR.
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
Replaces the legacy 'CSV as plain text' ingestion with a typed
CSVTable representation that preserves encoding, delimiter, line-ending
style, and per-row cell boundaries. Pairs a batch adapter for chat
attachments with a streaming variant for multi-GB exports.

- CSVTable / CSVRecord: typed representation + one streamed row shape.
- CSVParser: shared RFC-4180-ish state machine. Handles quoted fields,
  '""' quote escapes, embedded newlines in quoted cells, CRLF / LF /
  bare-CR line endings.
- CSVAdapter: eager, in-memory. Delimiter defaults per extension
  ('.csv' -> ',', '.tsv' -> '\t'). UTF-8 BOM stripping + ISO-Latin-1
  fallback decode. Conservative header heuristic (first row is a
  header when at least one cell is non-numeric and there's a body
  row below). Renders a markdown-style text fallback for chat display.
- CSVStreamer: row-at-a-time AsyncThrowingStream for large files.
  Reads 64 KB chunks, splits at the last complete UTF-8 scalar so
  multi-byte scalars never cross a chunk boundary, feeds bytes
  through the same CSVParser.Machine so quoting / newline semantics
  match the batch path exactly. Honours Task cancellation so the
  agent tool surface can back-pressure.
- Registers both in DocumentAdaptersBootstrap after PlainText so
  later-wins routing picks the typed adapter for '.csv' / '.tsv'.
- Tests: 10 adapter tests (header split, TSV delimiter, quoted commas
  + newlines, '""' escape, UTF-8 BOM, numeric-only header rejection,
  size-limit refusal, empty-file emptyContent, CRLF, canHandle) + 7
  streamer tests (in-order yield, 1-based line numbering, TSV,
  quoted newlines across chunks, cancellation mid-file, UTF-8
  boundary helper coverage).
Upgrades the PDFAdapter from flat text to a typed PDFDocumentRepresentation
that carries per-page text PLUS a list of detected tables. Turns invoices,
bank statements, and 10-Ks from 'run-together numeric columns' into
proper cell grids without changing the flat-text contract other consumers
rely on.

Detection strategy:
  1. Walk each page's characters and capture (scalar, rect) from
     PDFPage.characterBounds(at:). Whitespace glyphs are dropped up
     front because PDFKit reports their bounds as spanning the visual
     gap they introduce, which would hide column boundaries.
  2. Cluster glyphs into rows by y-coordinate tolerance (3pt).
  3. Within each row, split into cells wherever the inter-glyph gap
     exceeds 8pt — clearly above word-space (~3pt at 12pt body) but
     well below intentional column gaps (>20pt typical).
  4. Collect runs of multi-cell rows into PDFTable regions. Isolated
     single-tabular rows are dropped so form lines like
     'Invoice No. 1234' don't masquerade as tables.

- PDFDocumentRepresentation / PDFPageRepresentation / PDFTable: the new
  typed shape emitted by the adapter.
- PDFAdapter now emits PDFDocumentRepresentation instead of
  PlainTextRepresentation. textFallback stays the flat concatenation of
  page text so chat attachments render unchanged.
- PDFTableDetector: pure-function stages (clusterRows, cellsForRow,
  groupConsecutiveTabularRows, detect(glyphs:)) exposed internally for
  unit testing without PDFKit, so the heuristic can be pinned against
  synthetic glyph grids that aren't subject to Core Graphics' habit of
  reporting character bounds that span trailing whitespace.
- Image-only PDFs still throw emptyContent so the DocumentParser shim
  can fall through to the legacy image-render path.

Test suite (16 new):
  - Row clustering by y (including descending PDF-coord sort).
  - Cell splitting for wide gaps, word-in-cell, single glyph.
  - Tabular row grouping: multi-row collection, single-cell row split,
    drop-isolated-single-row, empty input.
  - Full detect(glyphs:) on a 3x3 synthetic grid.
  - End-to-end adapter integration (emits PDFDocumentRepresentation,
    preserves text fallback; blank PDF still throws emptyContent).
Adds the host-side bridge between the plugin ABI and the document
format registry so plugin-provided parsers and emitters plug into
DocumentFormatRegistry the same way the in-tree adapters do. A plugin
that registers a parser through this surface ends up as a regular
adapter consumers can look up via registry.adapter(for:) — no plugin-
specific branch in the consumer.

The plugin-side invocation (how a plugin's invoke pointer gets wired
back into the shim adapter) is structured around a PluginDocumentInvoker
protocol so the host-to-plugin callback is a single seam. This PR
wires the Swift side end-to-end and tests it with a fake invoker;
the PluginManager plumbing that threads each plugin's real invoke
pointer into PluginDocumentInvoker lands with a follow-up since it
needs access to PluginManager internals.

- osaurus_plugin.h: adds osr_register_parser_fn / register_emitter_fn /
  unregister_format_fn signatures and the trailing struct fields,
  with full request/response JSON contract documented inline.
  Trailing fields — older plugins compiled against the v2 layout
  pre-this-PR keep loading because the host allocates the struct and
  zero-inits the new tail.
- PluginBackedDocumentAdapter.swift: Swift shims implementing
  DocumentFormatAdapter and DocumentFormatEmitter by forwarding to a
  plugin via PluginDocumentInvoker.invoke(type:id:payload:). Surfaces
  only the textFallback representation today; richer representations
  (Workbook, PDFDocumentRepresentation) come with a response-schema
  extension once a first plugin needs them.
- PluginDocumentRegistry.swift: owns format_id -> plugin_id
  ownership so one plugin can't unregister another's format (or
  overwrite an in-tree built-in). Returns JSON envelopes matching the
  C-header contract.
- Tests: 8 scenarios covering happy-path registration, adapter →
  plugin invocation threading, plugin error propagation, emitter
  routing, another-plugin-cannot-overwrite, reject-unregister-by-
  other, unregisterAll teardown on plugin unload, and malformed-JSON
  rejection.
Closes the stage-4 arc: chat attachments now carry the typed
StructuredDocument (Workbook / CSVTable / PDFDocumentRepresentation)
for formats whose adapter emits more than a text fallback. Every
existing consumer that reads filename / documentContent / fileSize
still works unchanged — the accessors transparently bridge both cases.
Agent tools that know how to consume a Workbook downcast
attachment.structuredDocument?.representation.underlying.

- Attachment.Kind gains .structuredDocument(StructuredDocument) beside
  the existing .image / .document cases. Codable encodes the new case
  as the legacy .document wire shape so persisted chat history stays
  readable by older builds; the typed representation is rebuilt on
  every new file ingest rather than round-tripped through storage.
- Accessors (isDocument, filename, documentContent, fileSizeFormatted,
  fileIcon, estimatedTokens) handle both cases; a new
  attachment.structuredDocument accessor returns the typed value only
  for the new case so agent tools can downcast safely.
- fileIcon now maps tabular extensions (csv / tsv / xlsx / xls / ods)
  to tablecells so the chat chip reflects the format.
- DocumentParser.parseAll routes XLSX / CSV / PDF-with-tables through
  the new case; plaintext-family adapters (PlainTextRepresentation)
  keep emitting the legacy .document tuple since there's no typed
  structure worth surfacing.
- FloatingInputCard's inline chip row: exhaustive switch now pairs
  .document + .structuredDocument under the same DocumentChip path.
- Tests: 9 scenarios covering accessor bridging, typed-only
  structuredDocument accessor, fileIcon dispatch, tokens from
  textFallback, Codable downgrade on encode, legacy-shape decode, XLSX
  emits structured end-to-end, plaintext stays on .document.
Business rationale: Structured chat attachments are only ready for review if their branch passes the same local gate as the rest of the document stack; keeping the rebase lint-clean preserves trust in the file-fidelity harness without asking reviewers to sort through unrelated style failures.

Coding rationale: This commit is limited to strict-lint cleanup in files already touched by the stacked document branch. It keeps behavior unchanged while aligning optional initialization, local helper visibility, closure parameters, regex wrapping, and formatter-compatible multiline conditions with the repo gate.
@mimeding mimeding force-pushed the feat/structured-document-attachment branch from 3cf598d to be61b0e Compare May 4, 2026 00:25
@mimeding mimeding marked this pull request as draft May 10, 2026 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant