-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-8889] Trim unnecessary columns during MoR snapshot read #12677
base: master
Are you sure you want to change the base?
[HUDI-8889] Trim unnecessary columns during MoR snapshot read #12677
Conversation
39f927a
to
1f6e7dc
Compare
1. Trim unnecessary columns during MoR snapshot read Signed-off-by: TheR1sing3un <[email protected]>
4842151
to
18c1648
Compare
1. fix wrong embed internal schema Signed-off-by: TheR1sing3un <[email protected]>
@hudi-bot run azure |
* | ||
* @VisibleInTests | ||
*/ | ||
val mandatoryFields: Seq[String] | ||
lazy val optionalExtraFields: Seq[String] = Seq.empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how the fields got set up and whether they are required or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how the fields got set up and whether they are required or not.
mandatory
is for fields that are required to be read regardless of any read behavior, such as _hoodie_commit_time
in incremental-read, and optional
is for fields that are required to be merged during snapshot reads, such as _hoodie_record_key
and precombine-key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So why name the fields optional?
Consider following case when we perform snapshot-read on MoR table:
A
,B
,C
,D
)A
B
select D from table
_hoodie_record_key
,B
,D
)_hoodie_record_key
,B
,D
)_hoodie_record_key
,B
,D
)_hoodie_record_key
,B
,D
)_hoodie_record_key
,A
,B
,D
)However, except for the last two case, we only need to read column
D
on the file in other cases.Change Logs
Impact
Improves performance when MoR snapshot-read with not actually merged
Risk level (write none, low medium or high below)
low
Documentation Update
none
Contributor's checklist