PARQUET-756: Add Union Logical type #44

julienledem · 2016-10-26T18:39:04Z

No description provided.

rdblue · 2016-10-27T19:21:20Z

src/main/thrift/parquet.thrift

+   * A Union type
+   * This type annotates data stored as a Group
+   * this shows the intent to have heterogenous types under the same field name
+   * the names of the fields in the annotated Group are not important in such a case


I think this should include a few more details for implementing the Union type. For example, all fields of a union group must be optional and for each value, writers must set all but one option/field to null. It should also state that if more than one option is set, only one will be returned and which one is implementation-specific (because this is affected by column projection).

As we talked about in the sync-up, we should state that whether a union value can be null is determined by the union group's repetition: optional union groups can contain null and required union groups cannot. That way, we can always distinguish between a null value in the union (group level is null) and a non-null is a union option that wasn't projected.

Could you also update the logical types documentation?

julienledem · 2016-10-28T01:47:55Z

@rdblue updated

rdblue

I think we should clarify the handling of null values a bit more. Otherwise this is looking good.

rdblue · 2016-10-28T17:31:10Z

LogicalTypes.md

+A Union can not contain null but can be null itself if in an optional field.
+
+// Union<String, Integer, Boolean> (nullable union of either String, Integer or Boolean)
+optional group my_union (Union) {


Nit: Union should be UNION

rdblue · 2016-10-28T17:33:59Z

LogicalTypes.md

+If more than one is defined the behavior is undefined and may changed depending on the projection applied.
+A Union can not contain null but can be null itself if in an optional field.
+
+// Union<String, Integer, Boolean> (nullable union of either String, Integer or Boolean)


I think it would be more clear to use "Union of null, String, Integer, or Boolean" because the intent is not to have a Union container that may itself be null, although that is how it is stored. If we're going with Java-ish types for clarity, then maybe it's more accurate to use "Void | String | Integer | Boolean".

Unlike Avro, in Parquet Null is not a type. Just like Avro Records do not have optional fields.
So I think that this spec should not treat Null as a type.
It is clearer to have one-to-one mapping between the types here and the fields in the Group since this describes how it is stored.

If we want to clarify this, I can add the following:
Mapping to Avro types:

an Avro Union that contains Null and at least 2 other types will map to an optional Parquet Union (of the remaining types).

an Avro Union that does not contain null will map to a required Parquet Union.

I'm not suggesting we treat null as a type. I just want this to be clear that the group is not exposed. If the group level is defined, then one (and only one) branch is non-null and that is the value. If the group level is not defined, then the value is null.

rdblue · 2016-10-28T17:34:34Z

LogicalTypes.md

+  optional boolean bool;
+}
+
+// Union<String, Integer, Boolean> (required union of either String, Integer or Boolean)


I'd change the wording here, too. The union isn't required, one of the branches is required to be non-null.

rdblue · 2016-10-28T17:37:57Z

src/main/thrift/parquet.thrift

+   * The names of the fields in the annotated Group are not important in such a case.
+   * All fields of the Group must be optional and exactly one is defined for each instance of the group.
+   * If more than one is defined the behavior is undefined and may changed depending on the projection applied.
+   * A Union can not contain null but can be null itself if it's an optional field.


Could this say "Union groups that are required must contain at least one non-null field. Union groups that are optional can be null, which is used to encode the case where none of the branches of the union are non-null."? I want to be very clear how the union group's definition level is used to encode a null value, and how implementations should use that bit to distinguish between a null union value and a missing non-null branch.

Why "at least one non-null field" ? It should be exactly one non-null field.
"the case where none of the branches of the union are non-null." => the case where all of the branches of the union are null.
I will clarify.

You're right, it should be one non-null field, not at least one.

julienledem · 2016-11-01T23:00:59Z

@rdblue I have updated the spec per your comments.

rdblue · 2016-11-02T17:08:46Z

LogicalTypes.md

+
+ - If the union is nullable then at most one field is non-null and the field containing the union is optional
+```
+// Optional<Union<String, Integer, Boolean>>


I think it would be more clear if this was Union<String, Integer, Boolean> and noted "the value of the union may be null" rather than adding the Optional level. That makes it clear that it isn't actually wrapped, it is just that the value can be null in this case and that is encoded by the definition level.

rdblue · 2016-11-02T17:13:16Z

LogicalTypes.md

+  optional boolean bool;
+}
+```
+The union field is used to differentiate a null value (the field was null to start with) from a projection that excludes the non-null field.


How about "the definition level of the UNION group is used to differentiate a null value (the union was null to start with) . . .". I'd just avoid using the term "field" to refer to the whole group for clarity.

Same thing in the next couple of sentences, I think it is more clear to remove "field" and refer to whether the group is optional or required: "If the union group is null, then the value was null" / "If the union group is non-null, but all of the options within it are null, then the value was non-null but was an option that was not projected."

isnotinvain · 2016-11-03T07:47:57Z

I think we might need more discussion of how we want projects into unions to work.

I don't think all object models can return an "empty" union (eg thrift). Additionally, because we don't store group data, only data for each primitive column, in practice in order to do projections into unions you have to do things like pick an arbitrary child field of the union and use it's definition levels to figure out if the union "wing" is null or not. We've got a lot of these workarounds in the thrift implementation, but it'd be good to iron out how we really want that to work.

Another option would be to add a push down filter for any "branch" of a union that hasn't been selected in a projection.

rdblue · 2016-11-03T16:02:01Z

@isnotinvain, is the solution to the empty union simply that thrift can't project a subset of the union's branches?

I don't really like the idea of filtering rows that don't have non-null values, but I do see the problem. What about requiring the repetition of a union to be optional when projecting a subset of the branches?

isnotinvain · 2016-11-04T03:26:44Z

I'm not sure if I'm being clear, let me use an example of the issues we've seen with thrift union support.

Selecting only columns from 1 "wing" of a union
Lets say we have schema:

union Animal {
  1: optional Dog dog
  2: optional Cat cat
  3: optional Turtle turtle
}

struct Dog {
  1: required string name
  2: required string bark
}

struct Cat {
  1: required string name
  2: required int numLives
}

struct Turtle {
  1: required string name
}

And now, the user uses projection to select only the columns animal.cat.numLives
In any of the type safe implementations of parquet, we still have to return the user an instance of Animal. In the case of a record that is a dog, what should we return them?

One option is to return an 'empty' Dog. This has two issues:

Dog's fields are required -- thrift doesn't really let you get away with not setting those fields

this is solved by either putting null-like values (0, null, false, etc) in those fields
or by throwing an exception on access

When parquet encounters this record it is unknown whether this record is a Dog or a Turtle. All we know is that it's not a Cat (it's cat field is known to be null because animal.cat.numLives's definition levels tell us that. But we didn't load up any Dog/Turtle columns so we don't know if the Animal's optional dog / turtle fields are null or not (we don't store group level data, it's only in the primitive columns). So what we've done here in the past is pick at least one arbitrary column from each "wing" of the union so that we can look at the definition levels of all "wings"

Does that make sense?

isnotinvain · 2016-11-04T03:27:50Z

Oh, and, that's why filtering the "unknowns" (the is this a dog or a turtle case) away sort of makes sense? Like, the user didn't ask for any dog/turtle columns so I guess they don't want those records? (probably more confusing than helpful)

isnotinvain · 2016-11-04T03:37:13Z

One more thing I should probably clarify -- this would be pretty easy if thrift represented unions the way they look in the above schema definition, we could just return something like:

Animal x = new Animal()
x.dog = null
x.cat = null
x.turtle = null
return x

The problem in thrift is there will be no Animal concrete class. Just an interface called Animal and 3 subclasses. There is an "unknown" subclass -- but that is usually reserved for when we know, based on the data on disk, that we don'k know what kind of union this is (because our schema is older than the writer's schema). Turning a known record into an unknown based on projections seems incorrect too.

rdblue · 2016-11-04T16:50:44Z

When selecting Cat in your example, we shouldn't create an empty Dog or Turtle, but instead return null because Cat was null. That's why I think we may want to require the union to be optional when projecting some of its branches: we will need to fill in with null for branches that aren't projected. For Thrift -- and Avro when there is no NULL option -- I think the consequence is not allowing the user to project a subset of the union's branches because it violates the contract of the requested schema to return null.

It is better to fail early in this projection case than to fake the other branches. Like you said, we have a problem if we don't know which other union branch was non-null. We also can't add a filter. Say I want to know the percentage of PetOwners in California that have cats. Then I filter PetOwner on state = 'CA' and project pet.cat. The null values (not cats) are relevant.

In terms of the spec for a UNION logical type, I think we end up with this: "When projecting a proper subset of the union's branches, the union itself must be optional. The union value should be null for branches that are not projected."

julienledem · 2016-11-04T17:55:32Z

Where do we want to document this?
I'd separate what is defined by the format and what is specific to each model integration.
To me it sounds like we'd want to formalize the details in parquet-thrift and parquet-avro that have their own specificities in the object model (and their could be more than one way of doing it per model).
In LogicalTypes.md I'd add the following statements:

To know if a union is null you need to keep at least one column from one the branches.
if you project out some branches of the union the type of the union will be "unknown" for those at read time. The way to expose this is dependent on the object model integration (avro, thrift, ...). To know that type you need to keep at least one column from each branch.
Filtering "unknown" things out is defined by the model as well.
And add the details about Thrift and Avro in their respective directory (possibly link from here)

julienledem · 2016-11-04T18:10:30Z

I rebased and addressed @rdblue latest feedback before the union projection discussion.
I'm planning to add the details I mention in #44 (comment)
pending feedback from @rdblue and @isnotinvain .

rdblue · 2016-11-04T18:11:22Z

Implications in the object model should go in each model. Avro should document that it can't project unless the requested schema includes a NULL branch. However, the overall requirement that Parquet won't project a subset of the branches unless the union group is optional should be documented in the spec since that's a general requirement for how Parquet projection works.

julienledem · 2016-11-04T18:19:16Z

@rdblue Agreed except for the last bit: "Parquet won't project a subset of the branches unless the union group is optional"
I don't think it should. Users should be able to project whatever they want and we should not add artificial constraints like this.
If the union is required then a missing value just means "in one of the branches that you projected out". This is a totally valid use case. Typically you could want to read all cats and filter out all other branches of the union and you don't care whether the rows you ignored were turtles or dogs.

julienledem · 2016-11-04T18:20:23Z

@rdblue but this constraint can be defined in avro and thrift. Just not as a global rule.

rdblue · 2016-11-04T20:35:54Z

Users should be able to project whatever they want and we should not add artificial constraints like this.

I think you're right. I wasn't thinking that it was an artificial constraint, but there's no reason for Parquet to require this because its structure is clear.

isnotinvain · 2016-11-04T22:19:10Z

I'm not at a computer now but I will reply tonight. We can't return NULL
for an option union even, as that's not really correct -- that union wasn't
null it was a present Cat/Turtle -- claiming it was null is not really
accurate I think.

On Friday, November 4, 2016, Ryan Blue [email protected] wrote:

Users should be able to project whatever they want and we should not add
artificial constraints like this.

I think you're right. I wasn't thinking that it was an artificial
constraint, but there's no reason for Parquet to require this because its
structure is clear.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#44 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAWw6SOg3g-JADRVXTFIqP1SDhqnFApGks5q65crgaJpZM4KhhQQ
.

rdblue · 2016-11-04T22:56:14Z

I think it's for the object model to decide how to handle the projection, but that null is a reasonable option. The object model could expose the union as individual branches, or could create objects like Dog and Turtle with no fields, or have a "not project" sentinel. It's really up to the model implementers to determine. I think for Avro, we'll go with null and document that null will be returned when non-null branches aren't projected.

julienledem · 2016-11-08T23:15:10Z

@rdblue @isnotinvain I have updated the doc with more details about projecting unions.
Could you create each a PR for the following?

@rdblue: avro rules
@isnotinvain: thrift rules

thank you.

isnotinvain · 2016-11-09T00:50:45Z

I still have some reservations about this. I think an expected contract (but maybe not an explicit contract) that we have is that changing your projection shouldn't change your results, only your efficiency?

Lets say a user's query is something like:

long dogs = 0;
long cats = 0;
long others = 0;
Animal a = parquetData.next();

if (a.isDog()) {
  dogs++;
} else if (a.isCat()) {
  cats++;
} else {
  others++;
}

If the users selects all the columns, they'll get an accurate count of dogs and cats, and others.
If they select only some columns from Dog, then a.isCat() will return false, and that cat will get counted as an 'other'

This seems pretty surprising to me, and is why in the thrift integration we went through a lot of hoops to try to avoid it. I think if parquet is going to support unions as a first class concept, instead of pushing these complicated decisions to each object model, it'd be nice if parquet could handle this for us for all object models.

We could for example, write in parquet-core an efficient way to read on the definition levels of one child of each union branch, so that we can tell each object model which wing of the union an object belongs to. Then the object model can do w/e it wants with that info, but I think handeling this belongs in parquet-core ideally, and isn't super difficult nor inefficient (assuming we can read definition levels only).

isnotinvain · 2016-11-09T00:54:19Z

Another thing to consider about adding first class support for unions is efficiency.
We've found that parquet pays a pretty high cost reading nulls, especially in large schemas made up of lots of unions (in this case, each record is fairly small, but the schema is very large because there are so may fields total, but each record only populates a small number of them)

If we take advantage of knowing about unions in parquet-core, the read state machine / record assembly could skip a huge amount of asking column readers "are you null?" when it knows that they will all be null due to the fact that they are children of a union and that only one of the branches will ever contain non-nulls. Does that make sense?

isnotinvain · 2016-11-09T01:08:35Z

LogicalTypes.md

+```
+// Union<String, Integer, Boolean> (where the value of the union may be null)
+// at most one of either String, Integer or Boolean is non-null
+// if they are all null then the field my_union itself must be null


this is already stored in the definition levels of each branch as well though right? So there's no need to check even.

isnotinvain · 2016-11-09T01:09:29Z

LogicalTypes.md

+```
+The definition level of the UNION group is used to differentiate a null value (the union was null to start with) from a projection that excludes the non-null field.
+If the Union group is null then the value was null.
+If the Union group is non-null, but all of the options within it are null, then the value was non-null but was an option that was not projected.


Oh I understand now, yes we can use this to tell the difference, but unfortunately we can't really tell the user what kind of branch this was, which I don't think is great.

isnotinvain · 2016-11-09T01:10:20Z

LogicalTypes.md

+If the Union group is null then the value was null.
+If the Union group is non-null, but all of the options within it are null, then the value was non-null but was an option that was not projected.
+
+ - If - despite the spec - a group instance contains more than one non-null field the behavior is undefined and may change depending on the projection applied.


can we define this as it will throw an exception in parquet-core? It should be completely detectable, any reason to leave it undefined instead of explicitly fatal?

This PR implements 1) PARQUET-505: Column reader should automatically handle large data pages 2) Adds support for Serialization 3) Test case for Serialization and Deserialization 4) Test case for SerializedPageReader and PARQUET-505 Author: Deepak Majeti <[email protected]> Closes apache#44 from majetideepak/PARQUET-505 and squashes the following commits: 4f754ba [Deepak Majeti] changed type of page header size defaults 4345812 [Deepak Majeti] PARQUET-505: Column reader should automatically handle large data pages

dbtsai · 2019-01-25T17:51:45Z

@rdblue @julienledem We're looking to use union type in my company, and I found this JIRA. Wondering the status of this PR and why it's not merged in the end? Thanks!

rdblue · 2019-01-25T18:16:18Z

@dbtsai, the problem with Union is that its behavior isn't well-defined. It is difficult to decide what is correct, and support in processing engines is bad. I don't think it is a good idea to add them when you can instead get predictable behavior using optional objects. (Also, see the discussion on the Iceberg spec.)

matlarsen · 2019-01-26T00:50:47Z

Hi @rdblue @julienledem, I'm working with @dbtsai on this feature. We are scoping out support of UNION through AVRO -> Spark -> Parquet and its interesting to read this spec and the concerns that come out of it.

Some concerns here seem to revolve around the user-understood semantics of querying fields in UNIONS; I'd suggest that these semantics may already be solidified via other paradigms; for instance a SparkSQL selector that selects fields from an optional nested type will return null if the parent type does not exist.
It seems a little chicken-and-egg in terms of tooling support; in relation to the above could these behaviors be codified in the spec? It seems to me that the starting point of adding UNION support in the tooling would be to ensure that the formats support it. In terms of Parquet and interop with other formats this behavior is already undefined somewhat (in that the conversion code decides the embedding of UNIONS from one format to another); moving this behavior into the spec/format I think would help to close the gap on this slightly undefined behavior.

Please let me know your thoughts!

wesm · 2019-02-01T03:20:09Z

I'd be interested in having unions in the Parquet format. It would have to annotate a STRUCT having a first field that is an INT32 indicating which of the subsequent fields should be selected for each row in the dataset. Other fields can simply be null when they are unselected in the union

julienledem · 2021-09-18T01:26:08Z

I have not been active on this recently. If someone wants to push this to the finish line they should feel free to take over this PR

codeinred · 2023-06-08T19:16:29Z

What needs to be done to push it over the finish line?

Bernolt · 2024-06-08T09:57:39Z

Hello @julienledem @xhochy @isnotinvain @rdblue,
Are there any plans to integrate this change?

raj-nimble · 2024-08-06T00:11:40Z

Any estimate of when this might merge? In general, having Union types available would be very useful, especially coming from libraries that are Rust-based, as the Rust enum type with variants containing data is very common, and right now none of the crates leveraging parquet can write out structures containing enums.

They all link to apache/arrow-rs#73 which then links here.

If there is anything I can do to assist in getting this feature merged, I would be happy to help.

wgtmac · 2024-08-06T09:49:26Z

@Bernolt @raj-nimble Sorry for late reply. It seems this proposal has been sleeping for years. As a convention, the Parquet community requires a formal vote on the [email protected] with two reference implementations (parquet-java and another, usually parquet-cpp) to move forward.

rdblue reviewed Oct 27, 2016

View reviewed changes

rdblue requested changes Oct 28, 2016

View reviewed changes

xhochy approved these changes Nov 2, 2016

View reviewed changes

rdblue reviewed Nov 2, 2016

View reviewed changes

julienledem added 7 commits November 4, 2016 10:57

PARQUET-756: Add Union Logical type

73d2b06

add LogicalTypes doc

093f62a

add formating

42f10f4

clarifying description and null vs projection per review feedback

cc516de

clarifying description and null vs projection per review feedback

0345570

typo

352aded

clarified based on feedback

f852499

julienledem force-pushed the union branch from 6b1e99c to f852499 Compare November 4, 2016 18:06

add notes about projection

094c59b

isnotinvain reviewed Nov 9, 2016

View reviewed changes

rdblue mentioned this pull request Mar 6, 2018

Initial pass at adding ORC to Iceberg. Netflix/iceberg#12

Closed

loicalleyne mentioned this pull request Aug 17, 2022

genererating schemas from arbitrary map[string]interface{} (parquet, avro) redpanda-data/connect#1353

Open

tustvold mentioned this pull request Mar 30, 2023

[Donation Proposal]: OTEL Arrow Adapter open-telemetry/community#1332

Closed

mkarbo mentioned this pull request Aug 31, 2023

Feature request: Union dtype support pola-rs/polars#10827

Open

Jefffrey mentioned this pull request Nov 7, 2023

Add support for Union arrays in Parquet apache/arrow-rs#73

Open

kylebarron mentioned this pull request Jan 31, 2024

Add GeoArrow encoding as an option to the specification opengeospatial/geoparquet#189

Merged

mariosasko mentioned this pull request Feb 21, 2024

Adds parquet writer huggingface/datatrove#103

Merged

m-mohr mentioned this pull request Apr 27, 2024

Mixed concerns: Encoding + Geometry Type opengeospatial/geoparquet#207

Closed

asfimport mentioned this pull request Jun 23, 2024

Add Union Logical type #316

Open

mobiusklein mentioned this pull request Oct 21, 2024

Standardisation of cv_terms in parquet. bigbio/quantms.io#79

Open

PARQUET-756: Add Union Logical type #44

Are you sure you want to change the base?

PARQUET-756: Add Union Logical type #44

Conversation

julienledem commented Oct 26, 2016

Choose a reason for hiding this comment

julienledem commented Oct 28, 2016

rdblue left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julienledem commented Nov 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isnotinvain commented Nov 3, 2016

rdblue commented Nov 3, 2016

isnotinvain commented Nov 4, 2016 • edited Loading

isnotinvain commented Nov 4, 2016

isnotinvain commented Nov 4, 2016 • edited Loading

rdblue commented Nov 4, 2016

julienledem commented Nov 4, 2016

julienledem commented Nov 4, 2016

rdblue commented Nov 4, 2016

julienledem commented Nov 4, 2016

julienledem commented Nov 4, 2016

rdblue commented Nov 4, 2016

isnotinvain commented Nov 4, 2016

rdblue commented Nov 4, 2016

julienledem commented Nov 8, 2016 • edited Loading

isnotinvain commented Nov 9, 2016

isnotinvain commented Nov 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbtsai commented Jan 25, 2019

rdblue commented Jan 25, 2019

matlarsen commented Jan 26, 2019

wesm commented Feb 1, 2019

julienledem commented Sep 18, 2021

codeinred commented Jun 8, 2023

Bernolt commented Jun 8, 2024

raj-nimble commented Aug 6, 2024 • edited Loading

wgtmac commented Aug 6, 2024

isnotinvain commented Nov 4, 2016 •

edited

Loading

isnotinvain commented Nov 4, 2016 •

edited

Loading

julienledem commented Nov 8, 2016 •

edited

Loading

raj-nimble commented Aug 6, 2024 •

edited

Loading