trailing null bytes in query result are trimmed when strings_as_bytes=True and use_numpy=True #280

xinstein · 2022-01-07T12:05:54Z

Describe the bug
As the title says

To Reproduce

client.execute("create table testtable (a String) ENGINE=Log()")
client.insert_dataframe(
    "insert into table testtable values",
    pd.DataFrame(
        data={'a':[
            b'\x00\x00\x00\x00\x00\x00\x00\x002\xc0\xcf>\x00\x00\x00\x00', 
            b'\x00\x00\x00\x00\x00\x00\x00\x002\xc0\xcf>\x00\x00\x00\x00']}),
    settings={"use_numpy": True},
)
numpy_result = client.query_dataframe("select * from testtable", settings={'strings_as_bytes': True, "use_numpy": True})
normal_result = client.query_dataframe("select * from testtable", settings={'strings_as_bytes': True, "use_numpy": False})

Expected behavior
numpy_result and normal_result should match

Versions

clickhouse-driver-0.2.2
python 3.7.12

I've digged for quite a while in the code, but this seems to have been processed somewhere deeper.

The text was updated successfully, but these errors were encountered:

xinstein · 2022-01-07T13:55:34Z

This is important since use_numpy makes the data fetching process 10+ times quicker.
Which in my case consists a large portion of program running time

xinstein · 2022-01-09T14:18:58Z

This is expected behaviour in numpy (numpy/numpy#3878). But somewhat unexpected in clickhouse.

Clickhouse has no raw or bytes type so any custom serialized object will have to use String ( as hinted in clickhouse documentation)

I've found the deepest place where this numpy behaviour is triggered:

clickhouse-driver/clickhouse_driver/columns/numpy/stringcolumn.py

Line 29 in e66fe4a

return np.array(buf.read_strings(n_items), dtype=self.dtype)

return np.array(buf.read_strings(n_items), dtype=self.dtype)

I checked that self.dtype is None, and buf.read_strings(n_items) has trailing zeros retained, but np.array(buf.read_strings(n_items), dtype=self.dtype) has trailing zeros removed.
~~I tried using np.void as dtype as suggested in the numpy issue thread, it seems to solve the issue. But I don't know how to properly replace self.dtype with np.void, if I do I'm glad to fire a PR~~

np.void is not a good solution since the result behaves differently than plain bytes. using object does the job but I'm not sure if that affects performance. I've raised a PR using object.

This was referenced Jan 9, 2022

fixes the trailing zero bytes trimming problem #282

Closed

fixes the trailing zero bytes trimming problem #283

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trailing null bytes in query result are trimmed when strings_as_bytes=True and use_numpy=True #280

trailing null bytes in query result are trimmed when strings_as_bytes=True and use_numpy=True #280

xinstein commented Jan 7, 2022 •

edited

Loading

xinstein commented Jan 7, 2022

xinstein commented Jan 9, 2022 •

edited

Loading

trailing null bytes in query result are trimmed when strings_as_bytes=True and use_numpy=True #280

trailing null bytes in query result are trimmed when strings_as_bytes=True and use_numpy=True #280

Comments

xinstein commented Jan 7, 2022 • edited Loading

xinstein commented Jan 7, 2022

xinstein commented Jan 9, 2022 • edited Loading

xinstein commented Jan 7, 2022 •

edited

Loading

xinstein commented Jan 9, 2022 •

edited

Loading