Skip to content

Conversation

@voonhous
Copy link
Member

@voonhous voonhous commented Nov 24, 2025

Describe the issue this Pull Request addresses

This Pull Request addresses the ongoing effort to refactor the Hudi codebase to use a unified HoodieSchema type system for schema handling, migrating away from direct usage of Avro's Schema in client-side code where possible.

This specifically completes Phase 5: Java Client Core Migration as described in #14270.

NOTE: All changes made here avoids FileGroup* classes.

Summary and Changelog

This PR migrates key Hudi Java client components to use HoodieSchema for internal schema handling, enhancing maintainability without altering the Avro on-disk format.

  1. Core Migration: DeleteContext, RecordContext and HoodieReaderContext and their engine-specific implementations are updated to use HoodieSchema (instead of Avro Schema) for internal operations like getValue, toBinaryRow, and getOrderingValue.
  2. Utilities: HoodieSchemaUtils gains new client-side utilities, including getFieldPosition, projectSchema, and isReadCompatible.
  3. Integrates the HoodieSchema changes from the previous Column Statistics Migration phase into core utility classes (FileFormatUtils, HoodieTableMetadataUtil, etc.).

This ensures records and metadata are consistently processed using the HoodieSchema abstraction in memory.

Impact

Low

Risk Level

Low, risk is low due to refactoring maintaining Avro serialization compatibility via HoodieSchema.toAvroSchema().

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Nov 24, 2025
@yihua yihua added this to the release-1.2.0 milestone Nov 24, 2025
@voonhous
Copy link
Member Author

voonhous commented Nov 24, 2025

Still working on this, mergeBootstrapReaders has not been migrated yet.

@voonhous voonhous force-pushed the phase_5_java_client branch 5 times, most recently from c35119a to a47d4ed Compare November 26, 2025 09:11
@voonhous voonhous force-pushed the phase_5_java_client branch 5 times, most recently from 3d9e761 to e987dbd Compare December 2, 2025 13:49
@voonhous voonhous self-assigned this Dec 2, 2025
@voonhous voonhous changed the title feat: (schema - phase 5) Perform Java Client Core Migration feat (schema): phase 5 - Perform Java Client Core Migration Dec 2, 2025
@voonhous voonhous changed the title feat (schema): phase 5 - Perform Java Client Core Migration feat(schema): phase 5 - Perform Java Client Core Migration Dec 2, 2025
@voonhous voonhous force-pushed the phase_5_java_client branch 2 times, most recently from a8998d9 to 1dad9ee Compare December 3, 2025 09:19
@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Dec 3, 2025
@voonhous voonhous force-pushed the phase_5_java_client branch 6 times, most recently from 8ff89c2 to 6d27465 Compare December 4, 2025 15:05
@voonhous
Copy link
Member Author

voonhous commented Dec 4, 2025

Okay, should be ready for review now.

Copy link
Contributor

@the-other-tim-brown the-other-tim-brown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed some of the files but the scope of this PR is much larger than the ticket calls for. Is that intentional or is this combined with another branch?

@voonhous voonhous force-pushed the phase_5_java_client branch from 758cfbc to 8e39e79 Compare December 5, 2025 01:56
HoodieSchema longSchema = HoodieSchema.create(HoodieSchemaType.LONG);

List<HoodieSchemaField> fields = Arrays.asList(
HoodieSchemaField.of("precombine", HoodieSchema.createUnion(longSchema, nullSchema), null, 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the createUnion we can use the HoodieSchema.createNullable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swapped them out, will search for other references that i modified too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to call this out: Avro has a strict validation rule, where the default value for a union field must match the type of the FIRST element in the union type array.

  • ["null", "int"] with default null (Valid)
  • ["null", "int"] with default 0 (Invalid)
  • ["int", "null"] with default 0 (Valid)
  • ["int", "null"] with default null (Invalid)

If we use HoodieSchema.createNullable for int fields with default 0, the following error will be thrown for the affected tests:

Caused by: org.apache.avro.AvroTypeException: Invalid default for field age: 0 not a ["null","int"]
	at org.apache.avro.Schema.validateDefault(Schema.java:1635)
	at org.apache.avro.Schema.access$500(Schema.java:94)
	at org.apache.avro.Schema$Field.<init>(Schema.java:561)
	at org.apache.avro.Schema$Field.<init>(Schema.java:607)
	at org.apache.hudi.avro.HoodieAvroUtils.createNewSchemaField(HoodieAvroUtils.java:381)
	at org.apache.hudi.common.schema.HoodieSchemaField.of(HoodieSchemaField.java:113)
	at org.apache.hudi.common.schema.HoodieSchemaField.of(HoodieSchemaField.java:92)
	at org.apache.hudi.functional.TestBufferedRecordMerger.getSchema1(TestBufferedRecordMerger.java:815)

Comment on lines +95 to +97
// Schema can be null in test scenarios where schemas are not registered in the RecordContext (e.g. in tests)
if (schema != null) {
record = recordContext.seal(recordContext.toBinaryRow(schema, record));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want to seal in these cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the original code. AM just adding an additional check to ensure that schema is not null.

org.apache.hudi.common.engine.RecordContext#getSchemaFromBufferRecord calls
org.apache.hudi.common.engine.RecordContext#decodeAvroSchema, which is a nullable method.

@voonhous voonhous force-pushed the phase_5_java_client branch from fb1da7d to 881d8e1 Compare December 5, 2025 10:31
@voonhous
Copy link
Member Author

voonhous commented Dec 5, 2025

I've reviewed some of the files but the scope of this PR is much larger than the ticket calls for. Is that intentional or is this combined with another branch?

Nope, not combined with another branch. It's large because of how many other classes are using them. It cascades very quickly in a way where:

Let's change the variable to a HoodieSchema type -> Change parameter of method signature to HoodieSchema type.

The changes mainly affect the *Handles and the signature of *Handles changes, the same thing cascades.

I am trying not to touch the *Iterators and HoodieFileGroup* classes.

Most of the changes are in the hudi-client/hudi-client-common modules, and that's where everything starts cascading quickly.

@voonhous voonhous force-pushed the phase_5_java_client branch from 881d8e1 to e964274 Compare December 5, 2025 10:39
@hudi-bot
Copy link
Collaborator

hudi-bot commented Dec 5, 2025

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants