Iceberg unit tests, support Iceberg + nonhive catalogs, Iceberg Kryo Serializer #993

abbywh · 2025-05-15T20:30:50Z

Summary

We want to unit test with Iceberg test via CI as well as improve the support in the Chronon OSS package.

Why / Goal

Integrate with Iceberg unit tests and complete the work laid out in this comment: https://github.com/airbnb/chronon/pull/869/files#r1844519481
Support long partitions with Iceberg
Remove reliance on Partition Management API in favor of metadata tables, recommended for Iceberg: Iceberg does not work with Spark's default hive metastore (embedded Derby database) apache/iceberg#7847
This unblocks support for custom catalogs and iceberg, which was needed since Derby only integrations with the Partition Management API (DataSourcesV1)
To that extent we have integrated with the hadoop catalog for local testing

Follow Ups

Testing is very slow for the full suite on iceberg (almost an hour! pre Tuning Spark Test Performance #989 ), swapping from hadoop to hive + a SQL database like SQLite might help performance ex: Iceberg does not work with Spark's default hive metastore (embedded Derby database) apache/iceberg#7847 (comment)

Test Plan

[ x ] Added Unit Tests
[ x ] Covered by existing CI
[ x ] Integration tested

Added circleCI check

Checklist

[ N/A] Documentation update

Reviewers

nikhil-zlai

very clean!

nikhil-zlai · 2025-05-20T00:07:33Z

.gitignore

 .idea/
+*.jvmopts
+.bloop*
+.metals*


how is working with metals relative to intellij? does the debugger work as well?

It's really good actually. The debugged worked out of the box, I found it comparable to IntelliJ overall.

I'd recommend it to anyone who has remote dev boxes since VSCode's integration is far better in my experience. All the tests run a lot faster and I got in way more dev cycles. I probably would only recommend over IntelliJ with a dev box though.

thomaschow · 2025-05-20T16:21:05Z

spark/src/main/scala/ai/chronon/spark/TableUtils.scala

      sparkSession: SparkSession): Seq[Map[String, String]] = {
    sparkSession.sqlContext
-      .sql(s"SHOW PARTITIONS $tableName")
+      .sql(s"SELECT partition FROM $tableName" ++ ".partitions")


ooc does this work for regular hive tables?

This is for iceberg, Hive support is here

it should work for hive tables, and the internal target I'm hitting are more or less "regular hive tables". Iceberg abstracts itself from the catalog implementation, so as long as your iceberg has an interface to your catalog implementation, it will work.

krisnaru · 2025-05-27T20:05:17Z

LGTM. lets merge this ASAP

abbywh · 2025-05-27T21:14:23Z

+1 I anecdotally got some better performance out of this too, probably because we got better file pruning from the iceberg manifests vs directly querying Hive. I need another approval though.

krisnaru · 2025-05-27T21:18:56Z

+1 I anecdotally got some better performance out of this too, probably because we got better file pruning from the iceberg manifests vs directly querying Hive. I need another approval though.

@pengyu-hou could you PTAL

pengyu-hou

Thanks for the PR. Take a first pass and I will verify the new method to show partitions can work on our end.

pengyu-hou · 2025-05-30T12:43:07Z

spark/src/main/scala/ai/chronon/spark/SparkSessionBuilder.scala

          "spark.chronon.table_write.format" -> "delta"
        )
        (configMap, "ai.chronon.spark.ChrononDeltaLakeKryoRegistrator")
+        (configMap, "ai.chronon.spark.ChrononKryoRegistrator")


is this duplicate? There is same thing on line 69?

pengyu-hou · 2025-05-30T12:45:46Z

spark/src/main/scala/ai/chronon/spark/TableUtils.scala

      rdd
    }

-  def tableExists(tableName: String): Boolean = sparkSession.catalog.tableExists(tableName)


curious, does the old method not work for iceberg?

Good question! It does work for non Iceberg tables IF your underlying catalog supports this operation. This line of code is querying the catalog directly, but the more idiomatic thing to do with Iceberg is to use its built in partition APIs, which will be agnostic to your underlying catalog https://iceberg.apache.org/docs/latest/spark-queries/#spark-queries (note here that this also documents the point I made that Iceberg doesn't support DSv1)

I'm sure most setups use a catalog that works with DSv1, but ours does not. I have read that there's better pushdown in V2 sources but I can't really be a good source for that benchmark considering my setup doesn't work with V1

ok, this should be fine and they are equivalent anyways.

pengyu-hou · 2025-05-30T12:47:39Z

spark/src/main/scala/ai/chronon/spark/TableUtils.scala

                     partitions: Seq[String],
                     partitionColumn: String = partitionColumn,
                     subPartitionFilters: Map[String, String] = Map.empty): Unit = {
+    // TODO this is using datasource v1 semantics, which won't be compatible with non-hive catalogs


could you explain more on the dsv1 and dsv2? Do you have a pointer?

Sure! It's largely historical, but the tl;dr was that at some point the way you build a datasource connector was redone to support more sink formats with better performance. This is a good article: https://blog.madhukaraphatak.com/spark-datasource-v2-part-1 and this pdf: https://issues.apache.org/jira/browse/SPARK-15689. The most notable reason this is coming up is that Iceberg is not integrated with DataSourceV1.

spark/src/main/scala/ai/chronon/spark/TableUtils.scala

spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala

pengyu-hou · 2025-05-30T12:52:26Z

spark/src/test/scala/ai/chronon/spark/test/JoinTest.scala

  }

+ @Test
+  def testEventsEventsTemporalLongDs(): Unit = {


The test makes sense for long ds, but I am wondering do we want to use this particular test case? It might significantly increase the CI time.

I do think this could be reasonable to remove, however we use dateints really heavily and I like the insurance of having a high level unit test just to make sure it works end to end. My other testing might be sufficient alone, but we did have a recent regression.

ok, let's keep it then.

spark/src/test/scala/ai/chronon/spark/test/TableUtilsTest.scala

Co-authored-by: Pengyu Hou <[email protected]> Signed-off-by: Abby Whittier <[email protected]>

pengyu-hou

Thanks so much @abbywh !

Abby Whittier added 8 commits May 13, 2025 15:42

staging changes for testing iceberg

00d4da8

timeboxing test changes

93b7c92

bootstrapping spark test, 1/3 working on FormatTest

2d4ed6f

got most tests working except droppartitions and the new dateint test

878f64a

cleaning some local changes

4739e93

reverting some local changes

65c81a1

formatting

cab2c6a

more silly local hacks

f4bfb70

abbywh mentioned this pull request May 15, 2025

Turn off sbt delta lake tests in CI #903

Merged

4 tasks

abbywh and others added 3 commits May 15, 2025 18:25

Merge branch 'main' into iceberg_unit_tests

ac505d3

fixed the constant derby flake

ae2088f

refactored to match deltalake

dee01ab

abbywh changed the title ~~Iceberg unit tests, support Iceberg + nonhive catalogs~~ Iceberg unit tests, support Iceberg + nonhive catalogs, Iceberg Kryo Serializer May 16, 2025

Abby Whittier added 14 commits May 16, 2025 15:30

added Iceberg Kryo Serializer

69cd50d

scalafmt

5526abe

iceberg circleci integration

2b9c246

fixing typo

b423f4b

giving circleci a dependency

9e5eaae

removing env file

558adc3

moving integration test to spark_embedded

c1eb8a2

figured out why delta lake was on 2.13, need it for spark 3.2

510266c

typo

58a58ff

scalafmt

6ea4121

skipping the flink parts since it doesn't compile to 2.13.6

cb012a6

including TableUtilsTest as well in CI

117c4f7

sperating table utils and format for seperate jvms

7d3272f

typo

f009f97

abbywh marked this pull request as ready for review May 16, 2025 18:41

Abby Whittier and others added 2 commits May 16, 2025 21:03

corrected behavior for long partitions

02c1463

Merge branch 'main' into iceberg_unit_tests

b05ef1a

Abby Whittier and others added 7 commits May 17, 2025 12:28

eventeventlongds test, more kryo registration

059e16e

iceberg drop partitions

907d2f8

long partition testing

0065406

Merge branch 'main' into iceberg_unit_tests

a956c7f

unskipping fixed tests

e980fa1

changing test schema

4054892

updating drop partitions to be schemaless

2ffd32d

nikhil-zlai approved these changes May 20, 2025

View reviewed changes

thomaschow reviewed May 20, 2025

View reviewed changes

found bug during CI testing

50445f0

nikhil-zlai approved these changes May 27, 2025

View reviewed changes

pengyu-hou reviewed May 30, 2025

View reviewed changes

abbywh and others added 5 commits June 7, 2025 10:48

Apply suggestions from code review

589eba9

Co-authored-by: Pengyu Hou <[email protected]> Signed-off-by: Abby Whittier <[email protected]>

formatting

8cfd661

propping name refactor

c72b74f

fixing some typos

d13dc5e

Merge branch 'main' into iceberg_unit_tests

bd46d01

pengyu-hou approved these changes Jun 10, 2025

View reviewed changes

pengyu-hou merged commit 51e888b into airbnb:main Jun 10, 2025
9 checks passed

Iceberg unit tests, support Iceberg + nonhive catalogs, Iceberg Kryo Serializer #993

Iceberg unit tests, support Iceberg + nonhive catalogs, Iceberg Kryo Serializer #993

Uh oh!

Conversation

abbywh commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why / Goal

Follow Ups

Test Plan

Checklist

Reviewers

Uh oh!

nikhil-zlai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krisnaru commented May 27, 2025

Uh oh!

abbywh commented May 27, 2025

Uh oh!

krisnaru commented May 27, 2025

Uh oh!

pengyu-hou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pengyu-hou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

abbywh commented May 15, 2025 •

edited

Loading