You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In addition to the xlsx reader and writer, it would be very useful to have some functions to get some metadata about the workbook.
Things I'd like to know:
worksheet info
name
position
row offset (excel row address where the header row is found)
worksheet size would also be very useful, even if it's just the number of bytes of the worksheets/sheetXXX.xml file (uncompressed)
column layout that would be read given a set of reader parameters
column name
(duckdb) data type
excel column address (like A, B, etc)
Document Properties
I can imagine several ways in which could be implemented:
in the minimal case, a scalar function that takes the workbook filename as input and returns an array of sheet names (in order of appearance) would do. This would allow any applications to discover the sheet names and then they can do the rest of the discovery themselves using read_xlsx.
While the minimal implementation would already be great, I personally would prefer a table function that takes input parameters for each reader input parameter that would affect the type of the resultset of the reader, and which would return column level metadata for all sheets - kinda like what sniff_csv() does for csv files, but then for all sheets. The result of this table function would be one row for each column for each worksheet, with the sheet-level metadata repeated (denormalized) for each column
As for the document properties, That is not a fixed list, so I think those would best be handled separately. Ideally it should be possible to get them all as a JSON object, preserving all namespace, tag and attribute information, as well as content structure. Alternatively, a list of namespace, tagname, map of attributes (all varchar) and node value (as varchar) would probably already solve most use cases.
The text was updated successfully, but these errors were encountered:
In addition to the xlsx reader and writer, it would be very useful to have some functions to get some metadata about the workbook.
Do you some use cases where this information could be useful?, I can imagine this being useful for providing the user information about the file they import/export in a GUI like app.
Do you some use cases where this information could be useful?
I can imagine this being useful for providing the user information about the file they import/export in a GUI like app.
This is indeed the immediate use case I have in mind. The app is duckdb/wasm powered analysis. The way it works is, the user browses for a file, and right now the only thing I can do is apply the excel reader, which will just read whatever is the default sheet. I don't even know what other sheets there are, and even if I would want to read another sheet, I don't know how because I don't have the name. I would like to present the user with a list of sheets inside their workbook, so they can choose which one to analyze. Or they could choose to make all sheets available for analysis. But either way, I need a way to know what sheets are inside the workbook.
Apart from this use case I can imagine many more, including non-UI driven ones.
One can imagine building an ingestion framework where users can upload xlsx workbooks. To effectively categorize them, metadata like the user that created the workbook, creation date, and any custom properties would be invaluable. Esp. in corporate environments MS Office is setup so it adds all kinds of metadata to office documents which usually ends up as document properties. Without a way to read those the possibilities for making sense of it are greatly curtailed.
Now, I guess it might be possible to use some non-duckdb tooling to do this. But it would really be so much nicer if I could do it all with duckdb.
In addition to the xlsx reader and writer, it would be very useful to have some functions to get some metadata about the workbook.
Things I'd like to know:
I can imagine several ways in which could be implemented:
The text was updated successfully, but these errors were encountered: