Skip to content

feat(documents): XLSX read via CoreXLSX into a typed Workbook#929

Draft
mimeding wants to merge 4 commits into
osaurus-ai:mainfrom
mimeding:feat/xlsx-read-adapter
Draft

feat(documents): XLSX read via CoreXLSX into a typed Workbook#929
mimeding wants to merge 4 commits into
osaurus-ai:mainfrom
mimeding:feat/xlsx-read-adapter

Conversation

@mimeding
Copy link
Copy Markdown
Contributor

@mimeding mimeding commented Apr 23, 2026

Stacked on #927 while #927 awaits merge. This branch has been rebased onto current origin/main (fd2ffece, #1015). Because #927 is not merged yet, this PR still carries the adapter-wrapper commit; reviewers should focus on the XLSX read adapter commit and the small rebase cleanup commit.

Business rationale

This is the first spreadsheet-format slice that delivers file fidelity instead of reducing a workbook to unsupported bytes or a hand-copied table. Business users attach XLSX files constantly; preserving sheet names, formulas, booleans, merged ranges, and shared strings lets osaurus answer questions about real workbooks while keeping the typed data available for later agent tools. It advances the harness through file fidelity: the local agent can reason over the user's actual business document without sending it elsewhere or losing the structure that makes the file trustworthy.

Coding rationale

CoreXLSX 0.14.2 is used only at the adapter boundary because it already exposes the read-side OOXML structures we need and resolves cleanly on current SwiftPM. The osaurus-owned Workbook model is the public intermediate instead of leaking CoreXLSX types, which keeps future emitter, tool, or vendor-fork work insulated from library churn. Styles are intentionally not parsed in this slice because CoreXLSX crashes on common style defaults; value and formula fidelity can ship independently, while number formats and explicit date typing stay out until a tolerant styles parser exists. The text fallback remains deliberately lossy and readable, while high-fidelity consumers use the structured representation.

What changed

  • Added CoreXLSX to Packages/OsaurusCore/Package.swift and declared the XLSX fixture resource.
  • Added Workbook, Sheet, Row, Cell, CellValue, and CellRange as Sendable document models.
  • Added XLSXAdapter with format detection, workbook parsing, formula preservation, merged-range capture, and readable text fallback.
  • Registered the XLSX adapter through DocumentAdaptersBootstrap so DocumentParser.parseAll routes .xlsx through the registry.
  • Added the small XLSX fixture and adapter/bootstrap tests.
  • Kept the rebase cleanup limited to strict-lint shape fixes in touched files.

Validation

  • git fetch origin && git rebase origin/main — clean after resolving Package.resolved to the CoreXLSX dependency set.
  • swift build --package-path Packages/OsaurusCore — passed.
  • swift build --package-path Packages/OsaurusCore -c release — passed.
  • swift test --package-path Packages/OsaurusCore — passed, 1461 tests in 197 suites, with the sandbox integration suite skipped by its normal env gate.
  • xcrun swift-format lint --strict on every touched Swift file — passed.
  • swiftlint lint --strict on every touched Swift file — passed.
  • git diff --check origin/main...HEAD — passed.
  • CLI gate skipped because this slice does not touch Packages/OsaurusCLI.

Non-scope

  • No XLSX writing; the emitter remains a later slice.
  • No number formats, column widths, explicit date typing, or other style-derived fidelity.
  • No legacy .xls or ODS support.
  • No workbook agent tools in this PR.
  • No attachment-kind changes in this PR.

Residual risks

CoreXLSX is effectively unmaintained, so a future Swift release may require a fork or hand-rolled parser. The Workbook wrapper is the mitigation: downstream osaurus code depends on the local model, not the vendor type. The styles gap is also intentional rather than solved; any future feature that needs formatted dates or number formats must land a tolerant styles parser first.

@mimeding
Copy link
Copy Markdown
Contributor Author

CI test-core note for maintainers — same upstream failure as #928, not caused by this PR.

Error on every test-core run here:

error: Unable to find module dependency: 'CAsyncHTTPClient' (in target 'EventSource' from project 'EventSource')

Identical stack to #921 and #928 (CNIOLLHTTP, CNIOExtrasZlib, CNIOPosix, _NumericsShims). The break is environmental — it started between 19:14 UTC (when #927 was green) and 19:27 UTC (first failure across every PR). Likely culprit is an Xcode ↔ SwiftPM-traits bug around mattt/eventsource 1.1+'s opt-in AsyncHTTPClient trait; the trait is never enabled in osaurus but Xcode is pulling CAsyncHTTPClient as if it were. See my note on #928 for the full diagnosis.

Local verification for this PR's diff:

  • swift test --filter XLSXAdapter — 7/7 pass.
  • swift test --filter 'DocumentParserShim|DocumentFormatRegistry|PlainTextAdapter|PDFAdapter|RichDocumentAdapter' — all pass.
  • Full swift test on OsaurusCore — 915/915 pass.
  • xcrun swift-format lint --strict and swiftlint — clean on every new file.

The one chore commit on this branch (chore: re-trigger CI…) was me trying to force a rebuild; it's pollution and should be squashed at merge time.

@mimeding
Copy link
Copy Markdown
Contributor Author

Realignment update: I am not taking this forward as a Core PR. Based on the Core/plugin split discussed on #926, XLSX read fidelity belongs in osaurus-xlsx. I will preserve the useful behavior and tests here as plugin work: sheet names, formulas, merged ranges, booleans, shared strings, numeric cells, and compact text fallback. Core should stay limited to the neutral adapter/registry/plugin bridge.

@mimeding mimeding force-pushed the feat/xlsx-read-adapter branch 5 times, most recently from b1b8524 to 36499a0 Compare April 29, 2026 19:13
@mimeding mimeding force-pushed the feat/xlsx-read-adapter branch from ef50e2b to 42a0f45 Compare May 1, 2026 02:54
mimeding added a commit to mimeding/osaurus that referenced this pull request May 1, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
mimeding added a commit to mimeding/osaurus that referenced this pull request May 1, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
mimeding added a commit to mimeding/osaurus that referenced this pull request May 1, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
mimeding added a commit to mimeding/osaurus that referenced this pull request May 1, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
mimeding added a commit to mimeding/osaurus that referenced this pull request May 1, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
@mimeding mimeding marked this pull request as ready for review May 1, 2026 03:22
mimeding added a commit to mimeding/osaurus that referenced this pull request May 1, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
mimeding added a commit to mimeding/osaurus that referenced this pull request May 1, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
@mimeding mimeding force-pushed the feat/xlsx-read-adapter branch from 42a0f45 to 47417f6 Compare May 1, 2026 04:36
mimeding added a commit to mimeding/osaurus that referenced this pull request May 1, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
@mimeding mimeding force-pushed the feat/xlsx-read-adapter branch from 47417f6 to e4ea0c4 Compare May 1, 2026 20:17
mimeding added a commit to mimeding/osaurus that referenced this pull request May 1, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
mimeding added a commit to mimeding/osaurus that referenced this pull request May 1, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
mimeding added a commit to mimeding/osaurus that referenced this pull request May 1, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
mimeding added a commit to mimeding/osaurus that referenced this pull request May 1, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
mimeding and others added 4 commits May 3, 2026 19:12
…ntParser through the registry

Migrates the three ingress paths already handled by DocumentParser onto
the adapter surface introduced in the foundations PR, without changing
any user-observable behaviour. parseAll now consults the registry first
and falls back to its existing switch for anything an adapter hasn't
claimed or has declined — specifically image-only PDFs, which continue
to render via the legacy fallback until the layout-aware PDF rework
lands.

- PlainTextAdapter wraps the existing UTF-8 / ISO-Latin-1 retry path
  and the 500K-character truncation marker so the legacy behaviour
  stays byte-identical.
- PDFAdapter wraps PDFKit text extraction; it throws emptyContent when
  there is no text layer so the shim falls through to the legacy image-
  render path rather than claiming a result it cannot produce.
- RichDocumentAdapter wraps NSAttributedString across docx/doc/rtf/html;
  a single adapter for all four because they share the framework call
  today, splitting when high-fidelity DOCX lands.
- DocumentAdaptersBootstrap registers the three on the shared registry
  from AppDelegate.applicationDidFinishLaunching exactly once so the
  shim sees adapters on the first file ingress.
- PlainTextRepresentation is the neutral text shape for adapters that
  cannot yet publish a format-native representation; replaced per-format
  by Workbook / WordDocument / etc. in later PRs.
First real-fidelity document adapter. Reads .xlsx into a typed Workbook
representation carrying sheet names, cells with formula source strings,
merged-range references, shared strings, and cell types (number, shared
string, inline string, boolean). The text fallback renders each sheet
as a tab-separated table so callers still on the legacy Attachment.
Kind.document path see something readable.

The adapter deliberately does NOT call CoreXLSX's parseStyles() — that
entry point crashes on openpyxl-generated workbooks because the
library's PatternFill.patternType is non-optional while Excel's default
empty pattern omits the attribute. Everything we surface today is
style-independent; lifting that limitation (number formats, column
widths, dates stored as styled numbers) lives in a follow-up slice
behind a hand-rolled styles fallback.

- Package.swift: CoreXLSX 0.14.2 dependency for the core target,
  testTarget resource declaration for the xlsxwriter-produced fixture.
- Workbook / Sheet / Row / Cell / CellValue / CellRange: the typed
  intermediate that both the XLSX read path and the eventual XLSX write
  emitter round-trip through.
- XLSXAdapter: the actual CoreXLSX → Workbook translator + markdown-
  style text fallback.
- DocumentAdaptersBootstrap: registers XLSXAdapter alongside PlainText /
  PDF / RichDocument, so DocumentParser.parseAll now routes .xlsx
  through the registry instead of throwing unsupportedFormat.
- Tests/Documents/Fixtures/xlsx/sample.xlsx: 5.9 KB fixture with two
  sheets, a SUM formula, a merged range (A5:B5), shared strings, and
  explicit booleans. Exercises the parse paths for each fidelity feature.
- XLSXAdapterTests: 7 tests pinning format routing, sheet/cell
  structure, formulas, merged ranges, shared strings, booleans, text
  fallback formatting, and size-limit refusal.
- DocumentParserShimTests: expands the bootstrap assertion to include
  "xlsx" alongside the three existing adapter ids.
Business rationale: XLSX ingestion is a core file-fidelity path for business users, and the branch needs to stay reviewable and CI-clean after rebasing onto the stabilized main gate.

Coding rationale: This keeps the rebase cleanup scoped to SwiftLint-only shape fixes in touched files, preserving existing parser and app-delegate behavior while satisfying the repo's touched-file lint rule.

Co-authored-by: Codex <codex@openai.com>
@mimeding mimeding force-pushed the feat/xlsx-read-adapter branch from e4ea0c4 to 8210022 Compare May 3, 2026 22:26
mimeding added a commit to mimeding/osaurus that referenced this pull request May 3, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
mimeding added a commit to mimeding/osaurus that referenced this pull request May 3, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
mimeding added a commit to mimeding/osaurus that referenced this pull request May 3, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
mimeding added a commit to mimeding/osaurus that referenced this pull request May 3, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
mimeding added a commit to mimeding/osaurus that referenced this pull request May 4, 2026
…te_workbook

Exposes the typed Workbook surface to folder-mode agents. Stacks on top
of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4
round-trip goal: an agent can now ingest a spreadsheet, reason about
cells and formulas in their native types, and emit a modified workbook
— all without the model having to handroll XML.

- read_workbook: returns a compact JSON summary of every sheet
  (names, row counts, merged ranges, truncated cell sample). Capped at
  200 cells per sheet so large workbooks don't blow the context
  window; agents drop to read_workbook_cell for specific values.
- read_workbook_cell: single-cell lookup by (path, sheet, A1 ref).
  Returns value, formula source, and type in a one-line JSON payload.
- write_workbook: accepts a structured sheets array and emits the
  file via XLSXEmitter. Each cell carries its A1 ref, typed value,
  and optional formula; the schema enum guards against unknown
  types. write_workbook creates parent directories and surfaces a
  sheetCount / totalCells summary on success.
- All three plug into FolderToolFactory.buildCoreTools alongside
  file_read / file_write, so they're registered the moment a working
  folder is selected and go away when it's cleared.
- Tests: 8 tests covering sheet summary rendering, missing-file and
  out-of-root rejection, formula preservation on cell lookup, missing-
  sheet error, end-to-end write + re-parse fidelity, non-xlsx path
  refusal, and empty-sheets validation. Tests reuse the sample.xlsx
  fixture from the XLSX read PR.
@mimeding mimeding marked this pull request as draft May 10, 2026 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant