-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
64-bit integer values not properly supported #5
Comments
Addressed by #6. Comments welcome. |
First things first, thx for your detailed analysis. First Problem
As you already mentioned, Qlik seems to know only 32 Bit Integers. The problem is that Qlik (at least officially) does not document its QVD format in detail. Many insights have been derived through reverse engineering. Do you have any official sources that document that Qlik uses floats to store 64-bit integers? Although storing 64-bit integers in float could be a suitable workaround, I haven't thought of that yet. I'll check it. For example: elif isinstance(value, IntegerValue):
if value.calculation_value.bit_length() + 1 > 32:
return b"\02" + struct.pack("<d", float(value.calculation_value))
else:
return b"\01" + value.calculation_value.to_bytes(4, byteorder="little", signed=True) Disclaimer: The above example only works to a limited extent, as it can lead to a loss of precision. A 64-bit IEEE float (double) cannot store the entire bandwidth of a 64-bit integer. Second Problem
You're definitely right, this is a critical section. I didn't realize that df.values return a list of rows, while each row is consisting of a numpy array that requires a common data type. This leads to the most common super type. But it wouldn't be any different if you go through column by column, right? Because as far as I know, QVD files allow different data types within a column. But I have to check that. The problem here is, that QVD isn't really column-oriented, or? The QVD file stores the distinct possible values per column in a column-oriented format (Symbol Table) yes, but the actual records are stored in a row-oriented format (Index Table). Also the QvdFileWriter have to persist the data, at least after writing the symbol table, row by row and therefore it should be much more performant to iterate over an row-oriented format than an column-oriented format, right? |
@jamesoliverh I saw that you have already looked into this and implemented a solution with tests for each of my comments.😀 Even without changing the Writer logic. Thanks for that. |
@MuellerConstantin thanks for your detailed and insightful response. Indeed I agree with you about my statement about QV's "column-orientedness". It was likely either a bit strong or completely wrong - I'm not really familiar with QV / QVD files like you are. In any case, I think the column-oriented vectorised approach for btw, for what it's worth I checked (in a single very simple case) that encoding 64-bit ints as Happy for the PR to be reviewed now and receive feedback / make any desired changes. Many thanks in advance! |
The issue was resolved by pull request #6. The change will take effect with the next minor/patch release. The issue will remain open until the release. |
Description
64-bit integer values in dataframes / QvdTables are not properly supported when writing to qvd files, despite such values being supported in QlikView / QlikSense.
To reproduce
Consider the following dataframe containing the largest permissible int32:
This dataframe can be converted to a QvdTable and written to a qvd file just fine:
Consider now the following dataframe containing the largest permissible int32 plus one:
This dataframe can be converted to a QvdTable, but there is now a failure when writing the QvdTable to a qvd file:
The piece of code that fails is attempting to encode the 64-bit integer as 4 bytes. Figures.
Note that as far as I currently understand, in the case of bona fide integers, Qlik indeed only supports 32-bit values. However, it seems that 64-bit integers are stored in Qlik via what is called
DualDoubleValue
s in PyQvd. Therefore they should be treated as such.Discussion
Note that a particular choice in the current implementation of the
from_pandas()
method can obscure this issue in some cases. Consider adding another column to with a float value into the dataframe:This can be written to a qvd file without issue. This is because use of the
.values
attribute in the following line:will coerce the integer value to a float when the dataframe is converted to a numpy array. If any of the columns in the dataframe were object columns, the numpy array will be one of objects, the ints will remain as they are, and the problematic failure to encode will again occur. Imho, this behaviour is inconsistent and a bit weird.
Probably iterating over columns with
.items()
would make more sense:both because it leads to consistent behaviour and is consistent with the column-oriented specification of both pandas and QlikView / QlikSense. But then, given a column-oriented approach, the next obvious step would be to implement vectorisation for an easy speedup.
The text was updated successfully, but these errors were encountered: