Feature Request: workbook metadata, like sheet names, possibly worksheet column metadata. #25

rpbouman · 2025-02-17T14:43:25Z

In addition to the xlsx reader and writer, it would be very useful to have some functions to get some metadata about the workbook.

Things I'd like to know:

worksheet info
- name
- position
- row offset (excel row address where the header row is found)
- worksheet size would also be very useful, even if it's just the number of bytes of the worksheets/sheetXXX.xml file (uncompressed)
column layout that would be read given a set of reader parameters
- column name
- (duckdb) data type
- excel column address (like A, B, etc)
Document Properties

I can imagine several ways in which could be implemented:

in the minimal case, a scalar function that takes the workbook filename as input and returns an array of sheet names (in order of appearance) would do. This would allow any applications to discover the sheet names and then they can do the rest of the discovery themselves using read_xlsx.
While the minimal implementation would already be great, I personally would prefer a table function that takes input parameters for each reader input parameter that would affect the type of the resultset of the reader, and which would return column level metadata for all sheets - kinda like what sniff_csv() does for csv files, but then for all sheets. The result of this table function would be one row for each column for each worksheet, with the sheet-level metadata repeated (denormalized) for each column
As for the document properties, That is not a fixed list, so I think those would best be handled separately. Ideally it should be possible to get them all as a JSON object, preserving all namespace, tag and attribute information, as well as content structure. Alternatively, a list of namespace, tagname, map of attributes (all varchar) and node value (as varchar) would probably already solve most use cases.

aszenz · 2025-02-20T12:55:04Z

In addition to the xlsx reader and writer, it would be very useful to have some functions to get some metadata about the workbook.

Do you some use cases where this information could be useful?, I can imagine this being useful for providing the user information about the file they import/export in a GUI like app.

rpbouman · 2025-02-20T13:26:48Z

Do you some use cases where this information could be useful?

I can imagine this being useful for providing the user information about the file they import/export in a GUI like app.

This is indeed the immediate use case I have in mind. The app is duckdb/wasm powered analysis. The way it works is, the user browses for a file, and right now the only thing I can do is apply the excel reader, which will just read whatever is the default sheet. I don't even know what other sheets there are, and even if I would want to read another sheet, I don't know how because I don't have the name. I would like to present the user with a list of sheets inside their workbook, so they can choose which one to analyze. Or they could choose to make all sheets available for analysis. But either way, I need a way to know what sheets are inside the workbook.

Apart from this use case I can imagine many more, including non-UI driven ones.
One can imagine building an ingestion framework where users can upload xlsx workbooks. To effectively categorize them, metadata like the user that created the workbook, creation date, and any custom properties would be invaluable. Esp. in corporate environments MS Office is setup so it adds all kinds of metadata to office documents which usually ends up as document properties. Without a way to read those the possibilities for making sense of it are greatly curtailed.

Now, I guess it might be possible to use some non-duckdb tooling to do this. But it would really be so much nicer if I could do it all with duckdb.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: workbook metadata, like sheet names, possibly worksheet column metadata. #25

Feature Request: workbook metadata, like sheet names, possibly worksheet column metadata. #25

rpbouman commented Feb 17, 2025 •

edited

Loading

aszenz commented Feb 20, 2025

rpbouman commented Feb 20, 2025

Feature Request: workbook metadata, like sheet names, possibly worksheet column metadata. #25

Feature Request: workbook metadata, like sheet names, possibly worksheet column metadata. #25

Comments

rpbouman commented Feb 17, 2025 • edited Loading

aszenz commented Feb 20, 2025

rpbouman commented Feb 20, 2025

rpbouman commented Feb 17, 2025 •

edited

Loading