You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to fetch a column with TIMESTAMP_NTZ(9) dtype and the max datetime is '9999-12-31 00:00:00.000' and minimum is '1987-01-30 23:59:59.000'.
I get following error when I select from that column.
File "/home/jwyang/anaconda3/lib/python3.11/site-packages/snowflake/connector/result_batch.py", line 79, in _create_nanoarrow_iterator
else PyArrowTableIterator(
^^^^^^^^^^^^^^^^^^^^^
File "src/snowflake/connector/nanoarrow_cpp/ArrowIterator/nanoarrow_arrow_iterator.pyx", line 239, in snowflake.connector.nanoarrow_arrow_iterator.PyArrowTableIterator.__cinit__
File "pyarrow/table.pxi", line 4116, in pyarrow.lib.Table.from_batches
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
DT: timestamp[us]
vs
DT: timestamp[ns]
Because '9999-12-31 00:00:00.000' doesn't fit in int64 with ns precision, it seems like it is downcast to us precision on a batch basis in
As others have pointed this out our code sees datetime.datetime(9999, 12, 31, 23, 59, 59) and realizes that this will not fit into ns precision and automatically determines that we can safely cast this object down to us precision and then arrow refuses to mix the us and ns precisions into the same column.
The real problem is that the data you are requesting from Snowflake cannot be represented in Arrow. 9999-12-31 23:59:59.000000000 is technically supported by our server-side (albeit extreme timestamps are not suggested to be used). So this timestamp is impossible to represent in Arrow with the same precision.
I got around the issue by explicitly casting the column down to us precision like: SELECT id, ts::TIMESTAMP_NTZ(6) FROM table instead of SELECT * FROM table.
Now I agree that us automatically trying to fit the data into a lower precision only leads to issues in the long run, as it boxes us into 2 options:
When we see a single cell that doesn't fit into nanosecond precision we go back and update every row in the current result batch to micro second precision.
Not doing this smart down casting even if the data could safely be downcast and simply throwing this exception every time we detect data that cannot be represented.
I much prefer option number 2, as it makes the need for precision loss explicit and allows the users to evaluate if this is okay, or if some computation needs to be moved into Snowflake before the data is pulled out.
However; it's important to note that both of these options are technically backwards incompatible, so a major bump will be necessary either way.
Python version
Python 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]
Operating system and processor architecture
Linux-5.4.0-165-generic-x86_64-with-glibc2.31
Installed packages
What did you do?
What did you expect to see?
I tried to fetch a column with TIMESTAMP_NTZ(9) dtype and the max datetime is '9999-12-31 00:00:00.000' and minimum is '1987-01-30 23:59:59.000'.
I get following error when I select from that column.
Because '9999-12-31 00:00:00.000' doesn't fit in int64 with ns precision, it seems like it is downcast to us precision on a batch basis in
snowflake-connector-python/src/snowflake/connector/nanoarrow_cpp/ArrowIterator/CArrowTableIterator.cpp
Line 562 in 6a2a5b6
I am guessing downcasting is not applied to all batches and it results in different data types between batches which pyarrow does not allow.
Can you set logging to DEBUG and collect the logs?
The text was updated successfully, but these errors were encountered: