WIP Iceberg integration #47

margon8 · 2024-09-09T08:12:28Z

Add Iceberg support to Metabolic

braislchao · 2024-09-11T15:02:36Z

Generic Table source integration

A Table source can be format Iceberg or format Delta
We need to know its catalog and its name
With the catalog and the name we can find it in the corresponding implementation and get a generic Dataframe in both cases

    sources: [
        {
            catalog: "data_lake.silver_events_v3"
            name: "silver_events_v3"
            format: TABLE 
        },
    ]

braislchao · 2024-09-11T18:20:47Z

Test catalog based readstream in Iceberg reads:

data.writeStream
    .format("iceberg")
    .outputMode("append")
    .trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES))
    .option("checkpointLocation", checkpointPath)
    .toTable("database.table_name")

Queries with streaming sources must be executed with writeStream.start();
local.data_lake.letters
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();

braislchao · 2024-09-11T20:03:15Z

For catalog compatibility we can use Iceberg Session Catalog to add compatibility with Delta:

Spark's built-in catalog supports existing v1 and v2 tables tracked in a Hive Metastore. This configures Spark to use Iceberg's SparkSessionCatalog as a wrapper around that session catalog. When a table is not an Iceberg table, the built-in catalog will be used to load it instead.

…bility

braislchao · 2024-09-17T19:15:20Z

We have to find a way to standarize tests in local environment between this two options:

Using local directories and catalog
Referencing a glue catalog and s3 dev environments

For the moment, we are going to test it directly in glue dev environment with the following Spark configuration:

    implicit val spark = sparkBuilder
      .appName(s" Metabolic Mapper - $configPath")
      .config("spark.sql.sources.partitionOverwriteMode", "dynamic")
      .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension")
      .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
      .config("spark.databricks.delta.schema.autoMerge.enabled", "true")
      .config("spark.databricks.delta.optimize.repartition.enabled", "true")
      .config("spark.databricks.delta.vacuum.parallelDelete.enabled", "true")
      .config("spark.sql.catalog.spark_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
      .config("spark.sql.catalog.spark_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
      .config("spark.sql.catalog.spark_catalog.client.region", "eu-central-1")
      .config("spark.sql.defaultCatalog", "spark_catalog")
      .getOrCreate()

Since we are using iceberg 1.6.1, it's important to take into account this configuration options and not include iceberg in the --datalake-formats property:

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html#aws-glue-programming-etl-format-iceberg-enable

braislchao · 2024-09-19T11:11:01Z

To make the iceberg integration work, we needed to add the iceberg libraries as --extra-jars.
This worked maintaining the --datalake-formats property as delta:

"--extra-jars" = "s3://factorial-metabolic/extra-libs/metabolic-core-iceberg-assembly.jar,s3://factorial-metabolic/extra-libs/iceberg-aws-bundle-1.6.1.jar,s3://factorial-metabolic/extra-libs/iceberg-spark-runtime-3.3_2.12-1.6.1.jar"
"--datalake-formats" = "delta"

In the previous configuration we received an error of missing warehouse location. Seems like the warehouse is mandatory for Flue to be able to write in a s3 folder as the main catalog:

org.apache.iceberg.exceptions.ValidationException: Cannot derive default warehouse location, warehouse path must not be null or empty at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:49)

This property must be configured in Metabolic Spark configuration:

.config("spark.sql.catalog.spark_catalog.warehouse", "s3://factorial-metabolic/data-lake/dev/feature_test_iceberg_write/")

Initial Glue config writer test

braislchao · 2024-09-20T11:43:10Z

Added two changes not related with this PR but needed for test integrity:

Change log4j2.properties configuration in test environment
Ignore Secret Manager test (probably we need to solve the Token issues but first I need to understand the functionality of this test)

Writer test

f176d39

margon8 added the wip/donotmerge Do not merge label Sep 9, 2024

margon8 added 3 commits September 10, 2024 12:44

Reader

5dcb3a4

Integrating Icerberg as default table format

5a2645a

Working delta+iceberg side by side

a167fa9

braislchao self-assigned this Sep 10, 2024

braislchao added 2 commits September 11, 2024 12:30

Update Streaming Writer Test

e7e976c

Iceberg Reader Test with Batch

7e630da

Generic Iceberg Reader

48e3cba

braislchao added 9 commits September 11, 2024 22:29

Add Iceberg Stream Reader + tests. Catalog modifications pending.

85f9305

Clean Generic Reader

20dc7b8

Add Iceberg Stream append writer test

0997930

Clean Writer Test

74bd63a

Clean Metabolic functions

9e5da27

Add Delta tests to check batch read compatibility

7c40774

Implement generic Stream Reader + tests for Iceberg and Delta compati…

d7bd30a

…bility

Reformat GenericReaderTest

876bdeb

Simplify GenericReader

5667f12

braislchao added the enhancement New feature or request label Sep 18, 2024

braislchao added 6 commits September 19, 2024 16:10

Update Spark Config to enable Iceberg

af8213d

Add aws compatibility to sbt

690481d

Update writer tests with spark AWS configs

fd84c10

Local generic reader

9f006a0

Initial glue reader test

60a4d79

Refactor tests to new use cases

186824c

Initial Glue config writer test

braislchao added 2 commits September 20, 2024 13:13

Change test logging behaviour

380bd32

SecretNanagerTest token is not working

ca41d07

Fix region var

90d1e1b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP Iceberg integration #47

WIP Iceberg integration #47

margon8 commented Sep 9, 2024 •

edited by braislchao

Loading

braislchao commented Sep 11, 2024

braislchao commented Sep 11, 2024 •

edited

Loading

braislchao commented Sep 11, 2024 •

edited

Loading

braislchao commented Sep 17, 2024 •

edited

Loading

braislchao commented Sep 19, 2024 •

edited

Loading

braislchao commented Sep 20, 2024

WIP Iceberg integration #47

Are you sure you want to change the base?

WIP Iceberg integration #47

Conversation

margon8 commented Sep 9, 2024 • edited by braislchao Loading

braislchao commented Sep 11, 2024

braislchao commented Sep 11, 2024 • edited Loading

braislchao commented Sep 11, 2024 • edited Loading

braislchao commented Sep 17, 2024 • edited Loading

braislchao commented Sep 19, 2024 • edited Loading

braislchao commented Sep 20, 2024

margon8 commented Sep 9, 2024 •

edited by braislchao

Loading

braislchao commented Sep 11, 2024 •

edited

Loading

braislchao commented Sep 11, 2024 •

edited

Loading

braislchao commented Sep 17, 2024 •

edited

Loading

braislchao commented Sep 19, 2024 •

edited

Loading