Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: workbook metadata, like sheet names, possibly worksheet column metadata. #25

Open
rpbouman opened this issue Feb 17, 2025 · 2 comments

Comments

@rpbouman
Copy link

rpbouman commented Feb 17, 2025

In addition to the xlsx reader and writer, it would be very useful to have some functions to get some metadata about the workbook.

Things I'd like to know:

  • worksheet info
    • name
    • position
    • row offset (excel row address where the header row is found)
    • worksheet size would also be very useful, even if it's just the number of bytes of the worksheets/sheetXXX.xml file (uncompressed)
  • column layout that would be read given a set of reader parameters
    • column name
    • (duckdb) data type
    • excel column address (like A, B, etc)
  • Document Properties

I can imagine several ways in which could be implemented:

  • in the minimal case, a scalar function that takes the workbook filename as input and returns an array of sheet names (in order of appearance) would do. This would allow any applications to discover the sheet names and then they can do the rest of the discovery themselves using read_xlsx.
  • While the minimal implementation would already be great, I personally would prefer a table function that takes input parameters for each reader input parameter that would affect the type of the resultset of the reader, and which would return column level metadata for all sheets - kinda like what sniff_csv() does for csv files, but then for all sheets. The result of this table function would be one row for each column for each worksheet, with the sheet-level metadata repeated (denormalized) for each column
  • As for the document properties, That is not a fixed list, so I think those would best be handled separately. Ideally it should be possible to get them all as a JSON object, preserving all namespace, tag and attribute information, as well as content structure. Alternatively, a list of namespace, tagname, map of attributes (all varchar) and node value (as varchar) would probably already solve most use cases.
@aszenz
Copy link

aszenz commented Feb 20, 2025

In addition to the xlsx reader and writer, it would be very useful to have some functions to get some metadata about the workbook.

Do you some use cases where this information could be useful?, I can imagine this being useful for providing the user information about the file they import/export in a GUI like app.

@rpbouman
Copy link
Author

Do you some use cases where this information could be useful?

I can imagine this being useful for providing the user information about the file they import/export in a GUI like app.

This is indeed the immediate use case I have in mind. The app is duckdb/wasm powered analysis. The way it works is, the user browses for a file, and right now the only thing I can do is apply the excel reader, which will just read whatever is the default sheet. I don't even know what other sheets there are, and even if I would want to read another sheet, I don't know how because I don't have the name. I would like to present the user with a list of sheets inside their workbook, so they can choose which one to analyze. Or they could choose to make all sheets available for analysis. But either way, I need a way to know what sheets are inside the workbook.

Apart from this use case I can imagine many more, including non-UI driven ones.
One can imagine building an ingestion framework where users can upload xlsx workbooks. To effectively categorize them, metadata like the user that created the workbook, creation date, and any custom properties would be invaluable. Esp. in corporate environments MS Office is setup so it adds all kinds of metadata to office documents which usually ends up as document properties. Without a way to read those the possibilities for making sense of it are greatly curtailed.

Now, I guess it might be possible to use some non-duckdb tooling to do this. But it would really be so much nicer if I could do it all with duckdb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants