feat: Support iceberg、hudi、delta、hdfs data source. #875

Dludora · 2026-01-06T02:24:34Z

As the title says

gemini-code-assist · 2026-01-06T02:24:55Z

Summary of Changes

Hello @Dludora, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly extends the data handling capabilities of the system by integrating support for several popular distributed data sources and data lake table formats, including Iceberg, Hudi, Delta Lake, and HDFS. It introduces dedicated loading strategies for both local and Ray-based executors for these new sources. Furthermore, it enhances the data export functionality to allow direct writing to HDFS and Iceberg tables. A key improvement is the refactoring of filesystem creation, centralizing the logic for configuring S3 and HDFS access, which makes the codebase cleaner and more maintainable. These changes collectively enable more flexible and robust data processing workflows, especially in distributed environments.

Highlights

Expanded Data Source Support: Introduced new data loading strategies for Iceberg, Hudi, Delta Lake, and HDFS, enabling the system to ingest data from a wider range of distributed storage and data lake formats.
HDFS and Iceberg Export Capabilities: Added functionality to export processed data to HDFS and Iceberg tables, enhancing the system's integration with big data ecosystems.
Centralized Filesystem Management: Refactored filesystem creation logic into a new utility function, create_filesystem_from_args, to streamline handling of S3 and HDFS credentials and configurations for both loading and exporting.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for several new data sources (Iceberg, Hudi, Delta, HDFS) and export targets (HDFS, Iceberg), which is a great enhancement. The implementation is mostly solid, with good use of utility functions for creating filesystems and robust fallback mechanisms for exporting data.

My review includes a few suggestions:

Refactoring duplicated code for HDFS filesystem creation into a shared utility.
Removing some commented-out code.
Fixing an inconsistency in the configuration validation for the Hudi data loading strategy.
Correcting the format of a new demo data file.

Overall, these are great additions to the project's capabilities.

gemini-code-assist · 2026-01-06T02:27:04Z

data_juicer/core/data/load_strategy.py

+    CONFIG_VALIDATION_RULES = {
+        "required_fields": ["table_uri"],
+        "optional_fields": [],
+        "field_types": {"path": str},


The field_types in CONFIG_VALIDATION_RULES is inconsistent. It specifies "path": str, but the required field is table_uri. This should be changed to "table_uri": str to match the usage in load_data and the required_fields list.

Suggested change

"field_types": {"path": str},

"field_types": {"table_uri": str},

gemini-code-assist · 2026-01-06T02:27:04Z

data_juicer/core/data/load_strategy.py

+    def _create_hdfs_fs(self):
+        import pyarrow.fs as fs
+
+        host = self.ds_config.get("host", None)
+        port = self.ds_config.get("port", None)
+        if port is not None:
+            port = int(port)
+        user = self.ds_config.get("user", None)
+        kerb_ticket = self.ds_config.get("kerb_ticket", None)
+        extra_conf = self.ds_config.get("extra_conf", None)
+        return fs.HadoopFileSystem(host=host, port=port, user=user, kerb_ticket=kerb_ticket, extra_conf=extra_conf)


This _create_hdfs_fs method is duplicated in RayHDFSDataLoadStrategy. To avoid code duplication and improve maintainability, consider refactoring this logic into a shared utility function or using the new create_filesystem_from_args function from s3_utils.py.

gemini-code-assist · 2026-01-06T02:27:04Z

data_juicer/core/data/load_strategy.py

+            # from data_juicer.utils.s3_utils import get_aws_credentials
+            from data_juicer.utils.model_utils import filter_arguments
+
+            # s3_config = {}
+            # if "s3.access_key_id" in catalog_kwargs:
+            #     s3_config["aws_access_key_id"] = catalog_kwargs.pop("s3.access_key_id")
+            # if "s3.secret_access_key" in catalog_kwargs:
+            #     s3_config["aws_secret_access_key"] = catalog_kwargs.pop("s3.secret_access_key")
+            # if "s3.session_token" in catalog_kwargs:
+            #     s3_config["aws_session_token"] = catalog_kwargs.pop("s3.session_token")
+            # if "s3.region" in catalog_kwargs:
+            #     s3_config["aws_region"] = catalog_kwargs.pop("s3.region")
+            # if "s3.endpoint" in catalog_kwargs:
+            #     s3_config["endpoint_url"] = catalog_kwargs.pop("s3.endpoint")
+            # aws_access_key_id, aws_secret_access_key, aws_session_token, aws_region = get_aws_credentials(s3_config)


This large block of commented-out code related to S3 configuration should be removed to improve code readability.

gemini-code-assist · 2026-01-06T02:27:04Z

demos/process_dist_sources/data/demo-dataset.jsonl

@@ -0,0 +1,146 @@
+


This file does not appear to be in valid JSONL format. In JSONL, each line must be a self-contained, valid JSON object. The current file has multi-line JSON objects and extra newlines, which will cause parsing errors. Please correct the format so that each JSON object is on a single line.

HYLcool

It's better to add new test cases for the new features in the corresponding test files.

HYLcool · 2026-01-12T03:11:41Z

data_juicer/utils/s3_utils.py

+    if path.startswith("s3://"):
+        validate_s3_path(path)
+
+        s3_keys = ["aws_access_key_id", "aws_secret_access_key", "aws_session_token", "aws_region", "endpoint_url"]
+        s3_conf = {k: args.pop(k) for k in s3_keys if k in args}
+        fs = create_pyarrow_s3_filesystem(s3_conf)
+        logger.info(f"Detected S3 export path: {path}. S3 filesystem configured.")
+
+    elif path.startswith("hdfs://"):
+        import pyarrow.fs as pa_fs
+
+        hdfs_keys = ["host", "port", "user", "kerb_ticket", "extra_conf"]
+        hdfs_conf = {k: args.pop(k) for k in hdfs_keys if k in args}
+        if "port" in hdfs_conf:
+            hdfs_conf["port"] = int(hdfs_conf["port"])
+        fs = pa_fs.HadoopFileSystem(**hdfs_conf)
+        logger.info(f"Detected HDFS export path: {path}. HDFS filesystem configured.")


Add an extra else branch to raise a warning or error for unsupported prefix.

HYLcool · 2026-01-12T03:14:13Z

data_juicer/core/ray_exporter.py

-            self.s3_filesystem = create_pyarrow_s3_filesystem(s3_config)
-            logger.info(f"Detected S3 export path: {export_path}. S3 filesystem configured.")
+        fs_args = copy.deepcopy(self.export_extra_args)
+        self.fs = create_filesystem_from_args(export_path, fs_args)


Checking if the returned fs is None might be necessary.

Dludora and others added 19 commits December 28, 2025 21:40

feat: support hdfs and iceberg

b54eb3a

feat: add load data from hdfs source

06f034a

feat: demo for hdfs load

c4c7f0b

feat: read iceberg file

37db9bf

feat: iceberg read

1368eaf

feat: process iceberg and hdfs

f4856b6

refractor: secret

9f8a551

feat: export iceberg and others

2b3b620

feat: write iceberg

4a24682

refractor: move fs

ebd23d2

refractor: move fs

12a4524

feat: delta and hudi

2744914

refractor: restore file_utils

79b3e04

refractor: add Any type

cc7b07b

feat: fallback ray_expoeter

b4c698c

Merge remote-tracking branch 'upstream/main'

9d78a7a

fix: pyproject

087941c

style: code style check

9d18059

restore: ray_executor s3

2b00595

Dludora had a problem deploying to Testing January 6, 2026 02:24 — with GitHub Actions Failure

Dludora temporarily deployed to Testing January 6, 2026 02:24 — with GitHub Actions Inactive

Dludora requested review from HYLcool and cyruszhang January 6, 2026 02:24

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

Merge branch 'datajuicer:main' into main

972f9ff

Dludora temporarily deployed to Testing January 6, 2026 09:36 — with GitHub Actions Inactive

yxdyc changed the title ~~feat: Support iceberg、hudi、delta、hdfs data source. Support hdfs、iceberg export~~ feat: Support iceberg、hudi、delta、hdfs data source. Jan 7, 2026

yxdyc added the dj:dataset issues/PRs about the dj-dataset label Jan 7, 2026

yxdyc added the dj:core issues/PRs about the core functions of Data-Juicer label Jan 7, 2026

yxdyc linked an issue Jan 7, 2026 that may be closed by this pull request

支持HDFS或者iceberg数据源 #848

Open

3 tasks

HYLcool reviewed Jan 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Support iceberg、hudi、delta、hdfs data source. #875

feat: Support iceberg、hudi、delta、hdfs data source. #875

Uh oh!

Dludora commented Jan 6, 2026

Uh oh!

gemini-code-assist bot commented Jan 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 6, 2026

Uh oh!

gemini-code-assist bot Jan 6, 2026

Uh oh!

gemini-code-assist bot Jan 6, 2026

Uh oh!

gemini-code-assist bot Jan 6, 2026

Uh oh!

HYLcool left a comment

Uh oh!

HYLcool Jan 12, 2026

Uh oh!

HYLcool Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	"field_types": {"path": str},
	"field_types": {"table_uri": str},

feat: Support iceberg、hudi、delta、hdfs data source. #875

Are you sure you want to change the base?

feat: Support iceberg、hudi、delta、hdfs data source. #875

Uh oh!

Conversation

Dludora commented Jan 6, 2026

Uh oh!

gemini-code-assist bot commented Jan 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

HYLcool left a comment

Choose a reason for hiding this comment

Uh oh!

HYLcool Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

HYLcool Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants