feat(documents): CSV / TSV streaming adapter with typed CSVTable by mimeding · Pull Request #939 · osaurus-ai/osaurus

mimeding · 2026-04-24T17:08:50Z

Stacked on #937, which carries #936/#929/#927 until the lower document stack merges. This branch has been rebased onto current origin/main (fd2ffece, #1015). Reviewers should focus on the CSV/TSV streaming adapter and the small strict-lint rebase cleanup; unrelated CI/TTS/tool-timeout commits from the stale branch were intentionally dropped during rebase.

Business rationale

CSV and TSV files are the lowest-friction interchange format for business data, but dumping entire large files into chat is slow, expensive, and easy to overwhelm small local models. A streaming adapter lets agents inspect tabular files with row and byte limits, preserve structure, and progressively work with high-volume data while keeping the harness local and file-fidelity oriented.

Coding rationale

The adapter uses a typed CSVTable representation instead of returning raw strings so future tools can reason about columns, rows, delimiters, and truncation consistently. Parsing and streaming are separated: CSVParser handles dialect-ish row parsing, while CSVStreamer owns capped iteration and preview budgeting. The adapter registers through DocumentFormatRegistry, matching the document stack introduced by #927/#929/#936/#937 rather than adding a one-off parser path. The rebase keeps main's lean folder-tool surface intact and only carries workbook tools from #937 because this branch is still stacked for review.

What changed

Added CSVTable as the typed representation for delimited tables.
Added CSVParser, CSVStreamer, and CSVAdapter for CSV/TSV parsing, previews, and capped streaming.
Registered the CSV/TSV adapter in DocumentAdaptersBootstrap.
Added CSVAdapterTests and CSVStreamerTests for delimiter handling, previews, row limits, truncation, and registry integration.
Kept lower-stack document adapter/workbook commits present only because feat(documents): wrap PlainText/PDF/DOCX as adapters and route DocumentParser through the registry #927/feat(documents): XLSX read via CoreXLSX into a typed Workbook #929/feat(documents): XLSX write via libxlsxwriter closes the round trip #936/feat(documents): agent tools read_workbook / read_workbook_cell / write_workbook #937 are still open.
Dropped stale unrelated CI/TTS/tool-timeout commits during rebase so this PR stays one logical change.

Validation

git fetch origin && git rebase origin/main - completed after resolving Package.resolved, keeping main's lean folder tool list plus workbook tools only, and dropping stale unrelated CI/TTS/tool-timeout commits.
swift build --package-path Packages/OsaurusCore - passed.
swift build --package-path Packages/OsaurusCore -c release - passed.
swift test --package-path Packages/OsaurusCore - passed, 1493 tests in 201 suites, with sandbox integration tests skipped by their normal environment gate.
xcrun swift-format lint --strict on every touched Swift file - passed.
swiftlint lint --strict on every touched Swift file - passed.
git diff --check origin/main...HEAD - passed.
CLI gate skipped because this slice does not touch Packages/OsaurusCLI.

Non-scope

No workbook agent tool changes beyond the stacked feat(documents): agent tools read_workbook / read_workbook_cell / write_workbook #937 commits.
No PDF table extraction.
No structured attachment UI plumbing.
No automatic type inference beyond the typed CSV/TSV table representation.
No support for binary spreadsheet formats.

Residual risks

The parser is intentionally conservative and does not try to recover every malformed CSV edge case. Very large files still need downstream row/column tools for selective operations, and reviewers may see lower-stack commits in the GitHub diff until #927/#929/#936/#937 merge.

…ntParser through the registry Migrates the three ingress paths already handled by DocumentParser onto the adapter surface introduced in the foundations PR, without changing any user-observable behaviour. parseAll now consults the registry first and falls back to its existing switch for anything an adapter hasn't claimed or has declined — specifically image-only PDFs, which continue to render via the legacy fallback until the layout-aware PDF rework lands. - PlainTextAdapter wraps the existing UTF-8 / ISO-Latin-1 retry path and the 500K-character truncation marker so the legacy behaviour stays byte-identical. - PDFAdapter wraps PDFKit text extraction; it throws emptyContent when there is no text layer so the shim falls through to the legacy image- render path rather than claiming a result it cannot produce. - RichDocumentAdapter wraps NSAttributedString across docx/doc/rtf/html; a single adapter for all four because they share the framework call today, splitting when high-fidelity DOCX lands. - DocumentAdaptersBootstrap registers the three on the shared registry from AppDelegate.applicationDidFinishLaunching exactly once so the shim sees adapters on the first file ingress. - PlainTextRepresentation is the neutral text shape for adapters that cannot yet publish a format-native representation; replaced per-format by Workbook / WordDocument / etc. in later PRs.

First real-fidelity document adapter. Reads .xlsx into a typed Workbook representation carrying sheet names, cells with formula source strings, merged-range references, shared strings, and cell types (number, shared string, inline string, boolean). The text fallback renders each sheet as a tab-separated table so callers still on the legacy Attachment. Kind.document path see something readable. The adapter deliberately does NOT call CoreXLSX's parseStyles() — that entry point crashes on openpyxl-generated workbooks because the library's PatternFill.patternType is non-optional while Excel's default empty pattern omits the attribute. Everything we surface today is style-independent; lifting that limitation (number formats, column widths, dates stored as styled numbers) lives in a follow-up slice behind a hand-rolled styles fallback. - Package.swift: CoreXLSX 0.14.2 dependency for the core target, testTarget resource declaration for the xlsxwriter-produced fixture. - Workbook / Sheet / Row / Cell / CellValue / CellRange: the typed intermediate that both the XLSX read path and the eventual XLSX write emitter round-trip through. - XLSXAdapter: the actual CoreXLSX → Workbook translator + markdown- style text fallback. - DocumentAdaptersBootstrap: registers XLSXAdapter alongside PlainText / PDF / RichDocument, so DocumentParser.parseAll now routes .xlsx through the registry instead of throwing unsupportedFormat. - Tests/Documents/Fixtures/xlsx/sample.xlsx: 5.9 KB fixture with two sheets, a SUM formula, a merged range (A5:B5), shared strings, and explicit booleans. Exercises the parse paths for each fidelity feature. - XLSXAdapterTests: 7 tests pinning format routing, sheet/cell structure, formulas, merged ranges, shared strings, booleans, text fallback formatting, and size-limit refusal. - DocumentParserShimTests: expands the bootstrap assertion to include "xlsx" alongside the three existing adapter ids.

Pairs with XLSXAdapter so agents can ingest a workbook, modify the typed Workbook in-process, and emit it back as a fresh .xlsx attachment. libxlsxwriter ships a first-party Swift Package as a pure C SwiftPM target, so no XCFramework / vendored C source is needed in osaurus itself — it's just a dependency add. - Package.swift: libxlsxwriter 1.2.4 dependency for the core target. - XLSXEmitter: Workbook -> .xlsx via libxlsxwriter. Parses A1 cell references into 0-indexed row/col, dispatches strings / numbers / booleans / formulas to the right write_* function, handles merged ranges via worksheet_merge_range with a nil string so the top-left cell's already-written content is preserved. Cleans up a partial .xlsx on any emit error so a failed round trip never masquerades as a readable file. - DocumentAdaptersBootstrap: registers XLSXEmitter alongside XLSXAdapter. - XLSXEmitterTests: 7 tests pinning the round trip end-to-end. Builds a Workbook in memory, writes via XLSXEmitter, reads via XLSXAdapter, asserts sheet names / formulas / merged ranges / strings / numbers / booleans all survive. Licensing footnote: libxlsxwriter is BSD-2-Clause, but bundles third_party/tmpfileplus/tmpfileplus.c under MPL 2.0. Statically linking is permitted. A follow-up to AcknowledgementsView should list both; deliberately out of scope for this PR.

…te_workbook Exposes the typed Workbook surface to folder-mode agents. Stacks on top of the XLSX read (osaurus-ai#929) + write (osaurus-ai#936) PRs and completes the stage-4 round-trip goal: an agent can now ingest a spreadsheet, reason about cells and formulas in their native types, and emit a modified workbook — all without the model having to handroll XML. - read_workbook: returns a compact JSON summary of every sheet (names, row counts, merged ranges, truncated cell sample). Capped at 200 cells per sheet so large workbooks don't blow the context window; agents drop to read_workbook_cell for specific values. - read_workbook_cell: single-cell lookup by (path, sheet, A1 ref). Returns value, formula source, and type in a one-line JSON payload. - write_workbook: accepts a structured sheets array and emits the file via XLSXEmitter. Each cell carries its A1 ref, typed value, and optional formula; the schema enum guards against unknown types. write_workbook creates parent directories and surfaces a sheetCount / totalCells summary on success. - All three plug into FolderToolFactory.buildCoreTools alongside file_read / file_write, so they're registered the moment a working folder is selected and go away when it's cleared. - Tests: 8 tests covering sheet summary rendering, missing-file and out-of-root rejection, formula preservation on cell lookup, missing- sheet error, end-to-end write + re-parse fidelity, non-xlsx path refusal, and empty-sheets validation. Tests reuse the sample.xlsx fixture from the XLSX read PR.

Replaces the legacy 'CSV as plain text' ingestion with a typed CSVTable representation that preserves encoding, delimiter, line-ending style, and per-row cell boundaries. Pairs a batch adapter for chat attachments with a streaming variant for multi-GB exports. - CSVTable / CSVRecord: typed representation + one streamed row shape. - CSVParser: shared RFC-4180-ish state machine. Handles quoted fields, '""' quote escapes, embedded newlines in quoted cells, CRLF / LF / bare-CR line endings. - CSVAdapter: eager, in-memory. Delimiter defaults per extension ('.csv' -> ',', '.tsv' -> '\t'). UTF-8 BOM stripping + ISO-Latin-1 fallback decode. Conservative header heuristic (first row is a header when at least one cell is non-numeric and there's a body row below). Renders a markdown-style text fallback for chat display. - CSVStreamer: row-at-a-time AsyncThrowingStream for large files. Reads 64 KB chunks, splits at the last complete UTF-8 scalar so multi-byte scalars never cross a chunk boundary, feeds bytes through the same CSVParser.Machine so quoting / newline semantics match the batch path exactly. Honours Task cancellation so the agent tool surface can back-pressure. - Registers both in DocumentAdaptersBootstrap after PlainText so later-wins routing picks the typed adapter for '.csv' / '.tsv'. - Tests: 10 adapter tests (header split, TSV delimiter, quoted commas + newlines, '""' escape, UTF-8 BOM, numeric-only header rejection, size-limit refusal, empty-file emptyContent, CRLF, canHandle) + 7 streamer tests (in-order yield, 1-based line numbering, TSV, quoted newlines across chunks, cancellation mid-file, UTF-8 boundary helper coverage).

Business rationale: CSV and TSV streaming make large business data files usable by agents, and the rebased branch must remain reviewable and CI-clean. Coding rationale: This keeps the cleanup scoped to touched-file lint shape issues inherited from the lower document stack while preserving the CSV adapter behavior and the lean folder-tool model from main. Co-authored-by: Codex <codex@openai.com>

mimeding force-pushed the feat/csv-streaming-adapter branch 3 times, most recently from c444ded to 3a1a134 Compare April 28, 2026 22:07

mimeding mentioned this pull request Apr 29, 2026

feat(documents): wrap PlainText/PDF/DOCX as adapters and route DocumentParser through the registry #927

Merged

mimeding force-pushed the feat/csv-streaming-adapter branch 2 times, most recently from 61c8ae3 to 69724fa Compare May 1, 2026 03:53

mimeding marked this pull request as ready for review May 1, 2026 05:10

mimeding force-pushed the feat/csv-streaming-adapter branch from 1f13e35 to 95f34f2 Compare May 1, 2026 20:53

mimeding and others added 8 commits May 3, 2026 19:51

Resolve CoreXLSX workspace packages

c3b7fa8

Resolve libxlsxwriter workspace package

2b77391

mimeding force-pushed the feat/csv-streaming-adapter branch from 95f34f2 to b56fab6 Compare May 3, 2026 23:12

mimeding marked this pull request as draft May 10, 2026 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(documents): CSV / TSV streaming adapter with typed CSVTable#939

feat(documents): CSV / TSV streaming adapter with typed CSVTable#939
mimeding wants to merge 8 commits into
osaurus-ai:mainfrom
mimeding:feat/csv-streaming-adapter

mimeding commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mimeding commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Business rationale

Coding rationale

What changed

Validation

Non-scope

Residual risks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mimeding commented Apr 24, 2026 •

edited

Loading