64-bit integer values not properly supported #5

jamesoliverh · 2024-12-20T12:27:00Z

Description

64-bit integer values in dataframes / QvdTables are not properly supported when writing to qvd files, despite such values being supported in QlikView / QlikSense.

To reproduce

Consider the following dataframe containing the largest permissible int32:

largest_int32 = 2147483647
df = pd.DataFrame([[largest_int32]], columns=['a']))
df

	a
0	2147483647

This dataframe can be converted to a QvdTable and written to a qvd file just fine:

QvdTable.from_pandas(df).to_qvd('./test.qvd')

Consider now the following dataframe containing the largest permissible int32 plus one:

largest_int32 = 2147483647
df = pd.DataFrame([[largest_int32 + 1]], columns=['a']))
df

	a
0	2147483648

This dataframe can be converted to a QvdTable, but there is now a failure when writing the QvdTable to a qvd file:

QvdTable.from_pandas(df).to_qvd('./test.qvd')

C:\dev/pyqvd\pyqvd\io\writer.py in _get_symbol_byte_representation(self, value)
    452                     str.encode(display_value, encoding="utf-8") + b"\0")
    453         elif isinstance(value, IntegerValue):
--> 454             return b"\01" + value.calculation_value.to_bytes(4, byteorder="little", signed=True)
    455         elif isinstance(value, DoubleValue):
    456             return b"\02" + struct.pack("<d", value.calculation_value)

OverflowError: int too big to convert

The piece of code that fails is attempting to encode the 64-bit integer as 4 bytes. Figures.

Note that as far as I currently understand, in the case of bona fide integers, Qlik indeed only supports 32-bit values. However, it seems that 64-bit integers are stored in Qlik via what is called DualDoubleValues in PyQvd. Therefore they should be treated as such.

Discussion

Note that a particular choice in the current implementation of the from_pandas() method can obscure this issue in some cases. Consider adding another column to with a float value into the dataframe:

df = pd.DataFrame([[largest_int32, 1.0]], columns=['a', 'b'])
df

	a	b
0	2147483647	1.0

This can be written to a qvd file without issue. This is because use of the .values attribute in the following line:

data = [[_get_symbol_from_pandas_value(value) for value in row] for row in df.values]

will coerce the integer value to a float when the dataframe is converted to a numpy array. If any of the columns in the dataframe were object columns, the numpy array will be one of objects, the ints will remain as they are, and the problematic failure to encode will again occur. Imho, this behaviour is inconsistent and a bit weird.

Probably iterating over columns with .items() would make more sense:

data = [[_get_symbol_from_pandas_value(value) for value in col] for (_, col) in df.items()]

both because it leads to consistent behaviour and is consistent with the column-oriented specification of both pandas and QlikView / QlikSense. But then, given a column-oriented approach, the next obvious step would be to implement vectorisation for an easy speedup.

The text was updated successfully, but these errors were encountered:

jamesoliverh · 2024-12-20T13:35:50Z

Addressed by #6.

Comments welcome.

MuellerConstantin · 2024-12-20T13:43:55Z

First things first, thx for your detailed analysis.

First Problem

Note that as far as I currently understand, in the case of bona fide integers, Qlik indeed only supports 32-bit values. However, it seems that 64-bit integers are stored in Qlik via what is called DualDoubleValues in PyQvd. Therefore they should be treated as such.

As you already mentioned, Qlik seems to know only 32 Bit Integers. The problem is that Qlik (at least officially) does not document its QVD format in detail. Many insights have been derived through reverse engineering. Do you have any official sources that document that Qlik uses floats to store 64-bit integers? Although storing 64-bit integers in float could be a suitable workaround, I haven't thought of that yet. I'll check it.

For example:

elif isinstance(value, IntegerValue):
    if value.calculation_value.bit_length() + 1 > 32:
        return b"\02" + struct.pack("<d", float(value.calculation_value))
    else:
        return b"\01" + value.calculation_value.to_bytes(4, byteorder="little", signed=True)

Disclaimer: The above example only works to a limited extent, as it can lead to a loss of precision. A 64-bit IEEE float (double) cannot store the entire bandwidth of a 64-bit integer.

Second Problem

will coerce the integer value to a float when the dataframe is converted to a numpy array. If any of the columns in the dataframe were object columns, the numpy array will be one of objects, the ints will remain as they are, and the problematic failure to encode will again occur. Imho, this behaviour is inconsistent and a bit weird.

You're definitely right, this is a critical section. I didn't realize that df.values return a list of rows, while each row is consisting of a numpy array that requires a common data type. This leads to the most common super type. But it wouldn't be any different if you go through column by column, right? Because as far as I know, QVD files allow different data types within a column. But I have to check that.

The problem here is, that QVD isn't really column-oriented, or? The QVD file stores the distinct possible values per column in a column-oriented format (Symbol Table) yes, but the actual records are stored in a row-oriented format (Index Table). Also the QvdFileWriter have to persist the data, at least after writing the symbol table, row by row and therefore it should be much more performant to iterate over an row-oriented format than an column-oriented format, right?

MuellerConstantin · 2024-12-20T14:13:18Z

@jamesoliverh I saw that you have already looked into this and implemented a solution with tests for each of my comments.😀 Even without changing the Writer logic. Thanks for that.

jamesoliverh · 2024-12-22T13:34:30Z

@MuellerConstantin thanks for your detailed and insightful response.

Indeed I agree with you about my statement about QV's "column-orientedness". It was likely either a bit strong or completely wrong - I'm not really familiar with QV / QVD files like you are.

In any case, I think the column-oriented vectorised approach for from_pandas() in the PR still makes sense, and is optional anyway. Let me know if you disagree.

btw, for what it's worth I checked (in a single very simple case) that encoding 64-bit ints as DualDoubleValues results in the same entry in the QVD file generated by PyQvd as what QlikView itself generates when exporting such a large int to QVD.

Happy for the PR to be reviewed now and receive feedback / make any desired changes. Many thanks in advance!

MuellerConstantin · 2024-12-22T19:29:01Z

The issue was resolved by pull request #6. The change will take effect with the next minor/patch release. The issue will remain open until the release.

jamesoliverh added the bug Something isn't working label Dec 20, 2024

MuellerConstantin self-assigned this Dec 20, 2024

jamesoliverh mentioned this issue Dec 20, 2024

Feature/implement 64 bit ints and vectorization #6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

64-bit integer values not properly supported #5

64-bit integer values not properly supported #5

jamesoliverh commented Dec 20, 2024

jamesoliverh commented Dec 20, 2024

MuellerConstantin commented Dec 20, 2024

MuellerConstantin commented Dec 20, 2024

jamesoliverh commented Dec 22, 2024 •

edited

Loading

MuellerConstantin commented Dec 22, 2024

64-bit integer values not properly supported #5

64-bit integer values not properly supported #5

Comments

jamesoliverh commented Dec 20, 2024

jamesoliverh commented Dec 20, 2024

MuellerConstantin commented Dec 20, 2024

MuellerConstantin commented Dec 20, 2024

jamesoliverh commented Dec 22, 2024 • edited Loading

MuellerConstantin commented Dec 22, 2024

jamesoliverh commented Dec 22, 2024 •

edited

Loading