-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2474: Add FIXED_SIZE_LIST logical type #241
base: master
Are you sure you want to change the base?
Changes from 4 commits
41fca3f
4f12dd3
cb93b27
83481f6
471efc3
77651fd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -256,6 +256,24 @@ The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`. | |
|
||
The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`. | ||
|
||
### FIXED_SIZE_LIST | ||
|
||
The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements | ||
of a primitive data type. It must annotate a `FIXED_LEN_BYTE_ARRAY` primitive type. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As written, the elements can themselves be arrays. Is this intended? Or should it be "non-array primitive data type"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't really consider the possibility of elements being arrays and I think non-array limitation makes sense. Changed to: The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
of a non-array primitive data type. It must annotate a `FIXED_LEN_BYTE_ARRAY` primitive type. |
||
|
||
The `FIXED_LEN_BYTE_ARRAY` data is interpreted as a fixed size sequence of | ||
elements of the same primitive data type. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should the encoding be defined as well, for instance the elements of the array are encoded in the same manner as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that seems like a thing to specify. Changed to: The `FIXED_LEN_BYTE_ARRAY` data is interpreted as a fixed size sequence of
elements of the same primitive data type encoded with plain encoding. |
||
|
||
The sort order used for `FIXED_SIZE_LIST` is undefined. | ||
|
||
### VARIABLE_SIZE_LIST | ||
rok marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The `VARIABLE_SIZE_LIST` annotation represents a variable-size list of elements | ||
of a primitive data type. It must annotate a `BYTE_ARRAY` primitive type. | ||
|
||
The `BYTE_ARRAY` data is interpreted as a variable size sequence of elements of | ||
the same primitive data type. | ||
|
||
## Temporal Types | ||
|
||
### DATE | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting choice to annotate a binary primitive field instead of a repeated group field. I see pros and cons with this design:
PROs:
CONs:
I think the PROs outweigh the CONs here, so I think this is fine with me. I just want everyone to be aware about the ramifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @tustvold, as you also brought up this point. I agree that having a new property of a repeated group would be more flexible, but it also comes at some cost, as outlined above. Also, it couldn't be just a logical type in this case, as a logical type cannot change the handling of R-Levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now feeling that maybe wrapping a
Vector[PrimitiveType, Size]
is also ok, but currently representing this is a bitweird in the model. May I ask would aVector
having data below?And would vector contains a "nested" vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main reason I'd like to propose this type, see apache/arrow#34510.
Lack of composability is a downside, but I think it's still worth the compromise. I've not seen need for fixed_size_list(struct) in tensor computing, but that's probably just because it's not available.
In tensor computation this is usually addressed with bitmasks, which can be stored as a
fixed_size_list(binary, num_values)
.Perhaps we should call this type
FixedSizeArray
to disambiguate?I think case 2. is ok, but case 1. should be expressed with a separate null bitmask that's not part of the type.