Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strings are in base64 encoding after conversion. #15

Open
ciqle opened this issue Aug 15, 2023 · 1 comment · May be fixed by #16
Open

Strings are in base64 encoding after conversion. #15

ciqle opened this issue Aug 15, 2023 · 1 comment · May be fixed by #16

Comments

@ciqle
Copy link

ciqle commented Aug 15, 2023

  • What's the issue
    In the parquet files generated from the conversion, strings are encoded in base64. It occurs to all the string fields, which may diverge from user's intentions.
    Take RelationWriteSupport.java as an example.
    memberRoleType = new PrimitiveType(REQUIRED, BINARY, "role");
    In the above piece of code, we call this constructor of primitiveType,
    we are actually setting its logicalTypeAnnotation to null. Therefore, parquet converter knows nothing about its actual type, then uses its default way to convert it as a binary - which is base64.

  • How to fix
    To fix, we can set the logicalTypeAnnotation parameter to stringtype. We know the tags are actully in string format, it should be safe to do so, and parquet convert will be aware the field is string and convert it using UTF-8 instead of base64.

@jimmyzzxhlh
Copy link

+1 on this. Currently after using osm-parquetizer we are getting key/values in the tag column like the following:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants