Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iceberg: fix spec inconsistency in manifest list files_count #24602

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Dec 18, 2024

The schema we are using was pulled some time ago and appears to be outdated. The Apache Iceberg Java implementation has since renamed added_data_files_count and friends to added_files_count, to match the documented spec.

This meant that after updating the table with an external non-Redpanda writer, Redpanda wouldn't be able to download the current manifest list when appending and get stuck, complaining about an EOF (presumably the Avro C++ library throws this when there is an unknown field).

I suspect that this may have been the cause of an EOF seen when trying to read a manifest list with BigQuery:

Error while reading data, error message: The Apache Avro failed to read data with the following error: EOF reached File: [...]/metadata/snap-[...]-0.avro

The old names are added as an alias to ensure Redpanda can still download Iceberg manifest lists from 24.3. This PR also adds an upgrade test to validate that this alias is actually required, rather than simply renaming the field. Without the alias, the upgrade test fails.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Bug Fixes

  • Fixes a bug in Redpanda's Iceberg manifest list Avro definition that previously resulted in an end-of-file (EOF) error when reading manifest list Avro files written by other engines. This could previously block Redpanda appending Iceberg data, and could also prevent certain query engines from successfully reading Iceberg data written by Redpanda.

@andrwng andrwng force-pushed the datalake-manifest-list-fix branch 2 times, most recently from b89b3ed to 68397b0 Compare December 18, 2024 01:55
The schema we are using was pulled some time ago and appears to be
outdated. The Apache Iceberg Java implementation has since renamed
added_data_files_count and friends to added_files_count, to match the
documented spec.

This meant that after updating the table with an external non-Redpanda
writer, Redpanda wouldn't be able to download the current manifest list
when appending and get stuck, complaining about an EOF (presumably the
Avro C++ library throws this when there is an unknown field).

I suspect that this may have also been the cause of an EOF seen when
trying to read a manifest list with BigQuery:

Error while reading data, error message: The Apache Avro failed to read data with the following error: EOF reached File: [...]/metadata/snap-[...]-0.avro

The old names are added as an alias to ensure Redpanda can still
download Iceberg manifest lists from 24.3.
Adds a simple upgrade test to go from 24.3 to HEAD, ensuring progress
can be made.

A recent change to our Iceberg manifest list schema changed the name of
one field, adding the old name as an alias. Without that aliasing, this
new test would fail.
@andrwng andrwng force-pushed the datalake-manifest-list-fix branch from 68397b0 to fe6155c Compare December 18, 2024 01:58
@andrwng andrwng changed the title iceberg: fix inconsistency in manifest list files_count iceberg: fix spec inconsistency in manifest list files_count Dec 18, 2024
@andrwng andrwng added this to the v24.3.x-next milestone Dec 18, 2024
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 18, 2024

Retry command for Build#59894

please wait until all jobs are finished before running the slash command



/ci-repeat 1
tests/rptest/tests/datalake/partition_movement_test.py::PartitionMovementTest.test_cross_core_movements@{"cloud_storage_type":1}
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":20}

@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#59894
test_id test_kind job_url test_status passed
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=20.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59894#0193d7dc-498b-46b5-9e78-c05a95af3147 FAIL 0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59894#0193d7dc-4989-42dd-ad69-30935788b832 FAIL 0/6

@ztlpn
Copy link
Contributor

ztlpn commented Dec 18, 2024

Hmm I wonder if we should backport this. Technically this is a breaking change that will prevent downgrades to earlier 24.3 versions. Or is backward compat also working thanks to aliases?

@andrwng
Copy link
Contributor Author

andrwng commented Dec 18, 2024

Hmm I wonder if we should backport this. Technically this is a breaking change that will prevent downgrades to earlier 24.3 versions. Or is backward compat also working thanks to aliases?

Yeah good thought. You're right, this will prevent Iceberg from working following a rollback, even with the alias, because 24.3.prev won't be able to read the manifest lists with the new Avro schema.

That said, I lean towards it being justifiable -- it's a correctness fix, especially for a beta feature.

@@ -134,6 +134,31 @@ def test_avro_schema(self, cloud_storage_type, query_engine):
assert spark_describe_out == spark_expected_out, str(
spark_describe_out)

@cluster(num_nodes=4)
@matrix(cloud_storage_type=[CloudStorageType.S3])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: will the test fail or be skipped when running on e.g. azure?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants