Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2474: Add FIXED_SIZE_LIST logical type #241

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,24 @@ The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`.

The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`.

### FIXED_SIZE_LIST
Copy link
Contributor

@JFinis JFinis May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting choice to annotate a binary primitive field instead of a repeated group field. I see pros and cons with this design:

PROs:

  • Guarantees zero-copy, as the layout is defined to be just bytes. In contrast, would this annotate a group, a writer could decide to use a fancy per-value encoding (e.g., dictionary) and thus create a list that first has to be "decoded" before it can be used.
  • Guarantees that a list is always contained on one page instead of being split over multiple pages. Again, this helps in keeping decoders easy and guaranteeing zero copy.
  • This solves the problem of redundant R-Levels. Since it's just a primitive column, no r-level considerations have to be taken into account.

CONs:

  • Cannot create fixed size lists of nested types (e.g., list of structs). I see that this isn't necessary for tensors or embedding vectors, but shouldn't the feature be extensible for other scenarios as well? This limits the composability of the feature. I can now create a struct of fixed size lists, but not a fixed size list of structs.
  • Cannot have null elements in fixed size lists. This might not be desired for all lists, but there can be use cases where having null values in them is preferrable.
  • Parquet has a concept for (non-fixed size) lists. It is conceptually weird that fixed size lists are totally different from (non-fixed size) lists.

I think the PROs outweigh the CONs here, so I think this is fine with me. I just want everyone to be aware about the ramifications.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @tustvold, as you also brought up this point. I agree that having a new property of a repeated group would be more flexible, but it also comes at some cost, as outlined above. Also, it couldn't be just a logical type in this case, as a logical type cannot change the handling of R-Levels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now feeling that maybe wrapping a Vector[PrimitiveType, Size] is also ok, but currently representing this is a bitweird in the model. May I ask would a Vector having data below?

1. [1, 1, 1], [null, 1, 1] <-- data with null
2. null, [1, 1, 1] <-- null vector

And would vector contains a "nested" vector?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This solves the problem of redundant R-Levels. Since it's just a primitive column, no r-level considerations have to be taken into account.

This is the main reason I'd like to propose this type, see apache/arrow#34510.

  • Cannot create fixed size lists of nested types (e.g., list of structs). I see that this isn't necessary for tensors or embedding vectors, but shouldn't the feature be extensible for other scenarios as well? This limits the composability of the feature. I can now create a struct of fixed size lists, but not a fixed size list of structs.

Lack of composability is a downside, but I think it's still worth the compromise. I've not seen need for fixed_size_list(struct) in tensor computing, but that's probably just because it's not available.

  • Cannot have null elements in fixed size lists. This might not be desired for all lists, but there can be use cases where having null values in them is preferrable.

In tensor computation this is usually addressed with bitmasks, which can be stored as a fixed_size_list(binary, num_values).

  • Parquet has a concept for (non-fixed size) lists. It is conceptually weird that fixed size lists are totally different from (non-fixed size) lists.

Perhaps we should call this type FixedSizeArray to disambiguate?

I'm now feeling that maybe wrapping a Vector[PrimitiveType, Size] is also ok, but currently representing this is a bitweird in the model. May I ask would a Vector having data below?

1. [1, 1, 1], [null, 1, 1] <-- data with null
2. null, [1, 1, 1] <-- null vector

And would vector contains a "nested" vector?

I think case 2. is ok, but case 1. should be expressed with a separate null bitmask that's not part of the type.


The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
of a primitive data type. It must annotate a `FIXED_LEN_BYTE_ARRAY` primitive type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, the elements can themselves be arrays. Is this intended? Or should it be "non-array primitive data type"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really consider the possibility of elements being arrays and I think non-array limitation makes sense. Changed to:

The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
of a non-array primitive data type. It must annotate a `FIXED_LEN_BYTE_ARRAY` primitive type.


The `FIXED_LEN_BYTE_ARRAY` data is interpreted as a fixed size sequence of
elements of the same primitive data type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the encoding be defined as well, for instance the elements of the array are encoded in the same manner as PLAIN encoding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that seems like a thing to specify. Changed to:

The `FIXED_LEN_BYTE_ARRAY` data is interpreted as a fixed size sequence of
elements of the same primitive data type encoded with plain encoding.


The sort order used for `FIXED_SIZE_LIST` is undefined.

### VARIABLE_SIZE_LIST
rok marked this conversation as resolved.
Show resolved Hide resolved

The `VARIABLE_SIZE_LIST` annotation represents a variable-size list of elements
of a primitive data type. It must annotate a `BYTE_ARRAY` primitive type.

The `BYTE_ARRAY` data is interpreted as a variable size sequence of elements of
the same primitive data type.

## Temporal Types

### DATE
Expand Down
22 changes: 16 additions & 6 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,13 @@ struct ListType {} // see LogicalTypes.md
struct EnumType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8
struct DateType {} // allowed for INT32
struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes
struct FixedSizeListType { // allowed for FIXED_LEN_BYTE_ARRAY[num_values * width of type],
1: required Type type; // see LogicalTypes.md
2: required i32 num_values;
}
struct VariableSizeListType { // allowed for BYTE_ARRAY, see LogicalTypes.md
1: required Type type;
}

/**
* Logical type to annotate a column that is always null.
Expand Down Expand Up @@ -397,12 +404,15 @@ union LogicalType {
8: TimestampType TIMESTAMP

// 9: reserved for INTERVAL
10: IntType INTEGER // use ConvertedType INT_* or UINT_*
11: NullType UNKNOWN // no compatible ConvertedType
12: JsonType JSON // use ConvertedType JSON
13: BsonType BSON // use ConvertedType BSON
14: UUIDType UUID // no compatible ConvertedType
15: Float16Type FLOAT16 // no compatible ConvertedType
10: IntType INTEGER // use ConvertedType INT_* or UINT_*
11: NullType UNKNOWN // no compatible ConvertedType
12: JsonType JSON // use ConvertedType JSON
13: BsonType BSON // use ConvertedType BSON
14: UUIDType UUID // no compatible ConvertedType
15: Float16Type FLOAT16 // no compatible ConvertedType
// 16: reserved for GEOMETRY
17: FixedSizeListType FIXED_SIZE_LIST // no compatible ConvertedType
18: VariableSizeListType VARIABLE_SIZE_LIST // no compatible ConvertedType
}

/**
Expand Down