Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trailing null bytes in query result are trimmed when strings_as_bytes=True and use_numpy=True #280

Open
xinstein opened this issue Jan 7, 2022 · 2 comments

Comments

@xinstein
Copy link

xinstein commented Jan 7, 2022

Describe the bug
As the title says

To Reproduce

client.execute("create table testtable (a String) ENGINE=Log()")
client.insert_dataframe(
    "insert into table testtable values",
    pd.DataFrame(
        data={'a':[
            b'\x00\x00\x00\x00\x00\x00\x00\x002\xc0\xcf>\x00\x00\x00\x00', 
            b'\x00\x00\x00\x00\x00\x00\x00\x002\xc0\xcf>\x00\x00\x00\x00']}),
    settings={"use_numpy": True},
)
numpy_result = client.query_dataframe("select * from testtable", settings={'strings_as_bytes': True, "use_numpy": True})
normal_result = client.query_dataframe("select * from testtable", settings={'strings_as_bytes': True, "use_numpy": False})

Expected behavior
numpy_result and normal_result should match

Versions

  • clickhouse-driver-0.2.2
  • python 3.7.12

I've digged for quite a while in the code, but this seems to have been processed somewhere deeper.

@xinstein
Copy link
Author

xinstein commented Jan 7, 2022

This is important since use_numpy makes the data fetching process 10+ times quicker.
Which in my case consists a large portion of program running time

@xinstein
Copy link
Author

xinstein commented Jan 9, 2022

This is expected behaviour in numpy (numpy/numpy#3878). But somewhat unexpected in clickhouse.

Clickhouse has no raw or bytes type so any custom serialized object will have to use String ( as hinted in clickhouse documentation)

I've found the deepest place where this numpy behaviour is triggered:

return np.array(buf.read_strings(n_items), dtype=self.dtype)

return np.array(buf.read_strings(n_items), dtype=self.dtype)

I checked that self.dtype is None, and buf.read_strings(n_items) has trailing zeros retained, but np.array(buf.read_strings(n_items), dtype=self.dtype) has trailing zeros removed.
I tried using np.void as dtype as suggested in the numpy issue thread, it seems to solve the issue. But I don't know how to properly replace self.dtype with np.void, if I do I'm glad to fire a PR

np.void is not a good solution since the result behaves differently than plain bytes. using object does the job but I'm not sure if that affects performance. I've raised a PR using object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant