Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support inferring a wider range of column types including INTEGER similar to spatial st_read #13

Open
suhas-dc opened this issue Feb 10, 2025 · 2 comments

Comments

@suhas-dc
Copy link

suhas-dc commented Feb 10, 2025

Currently duckdb-excel uses cell formats for type inferencing. But the cell styles recognized seem to have gaps.

For example, the extension cannot infer the column type to be an INTEGER, despite the cell format indicating 0 decimal places. This is supported when using the spatial extension.

When reading XLSX files, almost everything is read as either DOUBLE or VARCHAR depending on the Excel cell type. However, there are some caveats.

We try to infer TIMESTAMP, TIME, DATE and BOOLEAN types when possible based on the cell format.

Image

Example excel file with a numeric column formatted with 0 Decimal places - test.xlsx

Additional Information

There are also other issues with the cell style based inference, such as with "Date" columns as mentioned in #13 (comment).

@suhas-dc suhas-dc changed the title Support inferring column as INTEGER in read_xlsx Support inferring a wider range of column types including INTEGER Feb 10, 2025
@suhas-dc suhas-dc changed the title Support inferring a wider range of column types including INTEGER Support inferring column as INTEGER in read_xlsx Feb 10, 2025
@Maxxen
Copy link
Member

Maxxen commented Feb 10, 2025

Hello! thanks for opening this issue!
This sounds like a good idea, the only problem I could foresee with this is that Excel stores numbers as doubles under the hood, so in theory you could have integer-looking values that far surpass the range of the INTEGER BIGINT types - so the inference would have to take the cell values into account as well before narrowing a DOUBLE column to a INTEGER/BIGINT.

@suhas-dc suhas-dc changed the title Support inferring column as INTEGER in read_xlsx Support inferring a wider range of column types including INTEGER Feb 10, 2025
@suhas-dc suhas-dc changed the title Support inferring a wider range of column types including INTEGER Support inferring a wider range of column types including INTEGER similar to spatial st_read Feb 10, 2025
@suhas-dc
Copy link
Author

I noticed that if the first row of a Date column is an empty cell with "Date" formatting, duckdb-excel fails to infer the column as Date. But spatial seems to infer Date correctly.

Since you mentioned that parity with spatial's non-sample based inference is a target, I reframed the issue around that.

empty_date.xlsx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants