-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-756: Add Union Logical type #44
base: master
Are you sure you want to change the base?
Conversation
* A Union type | ||
* This type annotates data stored as a Group | ||
* this shows the intent to have heterogenous types under the same field name | ||
* the names of the fields in the annotated Group are not important in such a case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should include a few more details for implementing the Union type. For example, all fields of a union group must be optional and for each value, writers must set all but one option/field to null. It should also state that if more than one option is set, only one will be returned and which one is implementation-specific (because this is affected by column projection).
As we talked about in the sync-up, we should state that whether a union value can be null is determined by the union group's repetition: optional union groups can contain null and required union groups cannot. That way, we can always distinguish between a null value in the union (group level is null) and a non-null is a union option that wasn't projected.
Could you also update the logical types documentation?
@rdblue updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should clarify the handling of null values a bit more. Otherwise this is looking good.
A Union can not contain null but can be null itself if in an optional field. | ||
|
||
// Union<String, Integer, Boolean> (nullable union of either String, Integer or Boolean) | ||
optional group my_union (Union) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Union should be UNION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix
If more than one is defined the behavior is undefined and may changed depending on the projection applied. | ||
A Union can not contain null but can be null itself if in an optional field. | ||
|
||
// Union<String, Integer, Boolean> (nullable union of either String, Integer or Boolean) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more clear to use "Union of null, String, Integer, or Boolean" because the intent is not to have a Union container that may itself be null, although that is how it is stored. If we're going with Java-ish types for clarity, then maybe it's more accurate to use "Void | String | Integer | Boolean".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike Avro, in Parquet Null is not a type. Just like Avro Records do not have optional fields.
So I think that this spec should not treat Null as a type.
It is clearer to have one-to-one mapping between the types here and the fields in the Group since this describes how it is stored.
If we want to clarify this, I can add the following:
Mapping to Avro types:
- an Avro Union that contains Null and at least 2 other types will map to an optional Parquet Union (of the remaining types).
- an Avro Union that does not contain null will map to a required Parquet Union.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not suggesting we treat null as a type. I just want this to be clear that the group is not exposed. If the group level is defined, then one (and only one) branch is non-null and that is the value. If the group level is not defined, then the value is null.
optional boolean bool; | ||
} | ||
|
||
// Union<String, Integer, Boolean> (required union of either String, Integer or Boolean) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd change the wording here, too. The union isn't required, one of the branches is required to be non-null.
* The names of the fields in the annotated Group are not important in such a case. | ||
* All fields of the Group must be optional and exactly one is defined for each instance of the group. | ||
* If more than one is defined the behavior is undefined and may changed depending on the projection applied. | ||
* A Union can not contain null but can be null itself if it's an optional field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this say "Union groups that are required must contain at least one non-null field. Union groups that are optional can be null, which is used to encode the case where none of the branches of the union are non-null."? I want to be very clear how the union group's definition level is used to encode a null value, and how implementations should use that bit to distinguish between a null union value and a missing non-null branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "at least one non-null field" ? It should be exactly one non-null field.
"the case where none of the branches of the union are non-null." => the case where all of the branches of the union are null.
I will clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it should be one non-null field, not at least one.
@rdblue I have updated the spec per your comments. |
|
||
- If the union is nullable then at most one field is non-null and the field containing the union is optional | ||
``` | ||
// Optional<Union<String, Integer, Boolean>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more clear if this was Union<String, Integer, Boolean>
and noted "the value of the union may be null" rather than adding the Optional
level. That makes it clear that it isn't actually wrapped, it is just that the value can be null in this case and that is encoded by the definition level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
optional boolean bool; | ||
} | ||
``` | ||
The union field is used to differentiate a null value (the field was null to start with) from a projection that excludes the non-null field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "the definition level of the UNION group is used to differentiate a null value (the union was null to start with) . . .". I'd just avoid using the term "field" to refer to the whole group for clarity.
Same thing in the next couple of sentences, I think it is more clear to remove "field" and refer to whether the group is optional or required: "If the union group is null, then the value was null" / "If the union group is non-null, but all of the options within it are null, then the value was non-null but was an option that was not projected."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
I think we might need more discussion of how we want projects into unions to work. I don't think all object models can return an "empty" union (eg thrift). Additionally, because we don't store group data, only data for each primitive column, in practice in order to do projections into unions you have to do things like pick an arbitrary child field of the union and use it's definition levels to figure out if the union "wing" is null or not. We've got a lot of these workarounds in the thrift implementation, but it'd be good to iron out how we really want that to work. Another option would be to add a push down filter for any "branch" of a union that hasn't been selected in a projection. |
@isnotinvain, is the solution to the empty union simply that thrift can't project a subset of the union's branches? I don't really like the idea of filtering rows that don't have non-null values, but I do see the problem. What about requiring the repetition of a union to be |
I'm not sure if I'm being clear, let me use an example of the issues we've seen with thrift union support.
And now, the user uses projection to select only the columns One option is to return an 'empty' Dog. This has two issues:
Does that make sense? |
Oh, and, that's why filtering the "unknowns" (the is this a dog or a turtle case) away sort of makes sense? Like, the user didn't ask for any dog/turtle columns so I guess they don't want those records? (probably more confusing than helpful) |
One more thing I should probably clarify -- this would be pretty easy if thrift represented unions the way they look in the above schema definition, we could just return something like:
The problem in thrift is there will be no Animal concrete class. Just an interface called |
When selecting Cat in your example, we shouldn't create an empty Dog or Turtle, but instead return null because Cat was null. That's why I think we may want to require the union to be optional when projecting some of its branches: we will need to fill in with null for branches that aren't projected. For Thrift -- and Avro when there is no NULL option -- I think the consequence is not allowing the user to project a subset of the union's branches because it violates the contract of the requested schema to return null. It is better to fail early in this projection case than to fake the other branches. Like you said, we have a problem if we don't know which other union branch was non-null. We also can't add a filter. Say I want to know the percentage of PetOwners in California that have cats. Then I filter PetOwner on state = 'CA' and project pet.cat. The null values (not cats) are relevant. In terms of the spec for a UNION logical type, I think we end up with this: "When projecting a proper subset of the union's branches, the union itself must be optional. The union value should be null for branches that are not projected." |
Where do we want to document this?
|
I rebased and addressed @rdblue latest feedback before the union projection discussion. |
Implications in the object model should go in each model. Avro should document that it can't project unless the requested schema includes a NULL branch. However, the overall requirement that Parquet won't project a subset of the branches unless the union group is optional should be documented in the spec since that's a general requirement for how Parquet projection works. |
@rdblue Agreed except for the last bit: "Parquet won't project a subset of the branches unless the union group is optional" |
@rdblue but this constraint can be defined in avro and thrift. Just not as a global rule. |
I think you're right. I wasn't thinking that it was an artificial constraint, but there's no reason for Parquet to require this because its structure is clear. |
I'm not at a computer now but I will reply tonight. We can't return NULL On Friday, November 4, 2016, Ryan Blue [email protected] wrote:
|
I think it's for the object model to decide how to handle the projection, but that null is a reasonable option. The object model could expose the union as individual branches, or could create objects like Dog and Turtle with no fields, or have a "not project" sentinel. It's really up to the model implementers to determine. I think for Avro, we'll go with null and document that null will be returned when non-null branches aren't projected. |
@rdblue @isnotinvain I have updated the doc with more details about projecting unions.
thank you. |
I still have some reservations about this. I think an expected contract (but maybe not an explicit contract) that we have is that changing your projection shouldn't change your results, only your efficiency? Lets say a user's query is something like:
If the users selects all the columns, they'll get an accurate count of dogs and cats, and others. This seems pretty surprising to me, and is why in the thrift integration we went through a lot of hoops to try to avoid it. I think if parquet is going to support unions as a first class concept, instead of pushing these complicated decisions to each object model, it'd be nice if parquet could handle this for us for all object models. We could for example, write in parquet-core an efficient way to read on the definition levels of one child of each union branch, so that we can tell each object model which wing of the union an object belongs to. Then the object model can do w/e it wants with that info, but I think handeling this belongs in parquet-core ideally, and isn't super difficult nor inefficient (assuming we can read definition levels only). |
Another thing to consider about adding first class support for unions is efficiency. If we take advantage of knowing about unions in parquet-core, the read state machine / record assembly could skip a huge amount of asking column readers "are you null?" when it knows that they will all be null due to the fact that they are children of a union and that only one of the branches will ever contain non-nulls. Does that make sense? |
``` | ||
// Union<String, Integer, Boolean> (where the value of the union may be null) | ||
// at most one of either String, Integer or Boolean is non-null | ||
// if they are all null then the field my_union itself must be null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already stored in the definition levels of each branch as well though right? So there's no need to check even.
``` | ||
The definition level of the UNION group is used to differentiate a null value (the union was null to start with) from a projection that excludes the non-null field. | ||
If the Union group is null then the value was null. | ||
If the Union group is non-null, but all of the options within it are null, then the value was non-null but was an option that was not projected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I understand now, yes we can use this to tell the difference, but unfortunately we can't really tell the user what kind of branch this was, which I don't think is great.
If the Union group is null then the value was null. | ||
If the Union group is non-null, but all of the options within it are null, then the value was non-null but was an option that was not projected. | ||
|
||
- If - despite the spec - a group instance contains more than one non-null field the behavior is undefined and may change depending on the projection applied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we define this as it will throw an exception in parquet-core? It should be completely detectable, any reason to leave it undefined instead of explicitly fatal?
This PR implements 1) PARQUET-505: Column reader should automatically handle large data pages 2) Adds support for Serialization 3) Test case for Serialization and Deserialization 4) Test case for SerializedPageReader and PARQUET-505 Author: Deepak Majeti <[email protected]> Closes apache#44 from majetideepak/PARQUET-505 and squashes the following commits: 4f754ba [Deepak Majeti] changed type of page header size defaults 4345812 [Deepak Majeti] PARQUET-505: Column reader should automatically handle large data pages
@rdblue @julienledem We're looking to use union type in my company, and I found this JIRA. Wondering the status of this PR and why it's not merged in the end? Thanks! |
@dbtsai, the problem with Union is that its behavior isn't well-defined. It is difficult to decide what is correct, and support in processing engines is bad. I don't think it is a good idea to add them when you can instead get predictable behavior using optional objects. (Also, see the discussion on the Iceberg spec.) |
Hi @rdblue @julienledem, I'm working with @dbtsai on this feature. We are scoping out support of UNION through AVRO -> Spark -> Parquet and its interesting to read this spec and the concerns that come out of it.
Please let me know your thoughts! |
I'd be interested in having unions in the Parquet format. It would have to annotate a STRUCT having a first field that is an INT32 indicating which of the subsequent fields should be selected for each row in the dataset. Other fields can simply be null when they are unselected in the union |
I have not been active on this recently. If someone wants to push this to the finish line they should feel free to take over this PR |
What needs to be done to push it over the finish line? |
Hello @julienledem @xhochy @isnotinvain @rdblue, |
Any estimate of when this might merge? In general, having Union types available would be very useful, especially coming from libraries that are Rust-based, as the Rust enum type with variants containing data is very common, and right now none of the crates leveraging parquet can write out structures containing enums. They all link to apache/arrow-rs#73 which then links here. If there is anything I can do to assist in getting this feature merged, I would be happy to help. |
@Bernolt @raj-nimble Sorry for late reply. It seems this proposal has been sleeping for years. As a convention, the Parquet community requires a formal vote on the [email protected] with two reference implementations (parquet-java and another, usually parquet-cpp) to move forward. |
No description provided.