Skip to content

Fix HDFS extra configurations in native writer#542

Draft
yangzhg wants to merge 1 commit intobytedance:mainfrom
yangzhg:fix/hdfs-replication-extra-conf
Draft

Fix HDFS extra configurations in native writer#542
yangzhg wants to merge 1 commit intobytedance:mainfrom
yangzhg:fix/hdfs-replication-extra-conf

Conversation

@yangzhg
Copy link
Copy Markdown
Collaborator

@yangzhg yangzhg commented May 8, 2026

What problem does this PR solve?

Issue Number: close #541

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

This change adds explicit HDFS open-file session options for the native HDFS path:

  • bolt.io.file.buffer.size
  • bolt.dfs.replication
  • bolt.dfs.blocksize

Hive reads and writes now copy these values from connector session properties into FileOptions. The HDFS FileSystem consumes them when opening files and passes them directly to hdfsOpenFile as bufferSize, replication, and
blockSize.

The write sink path now forwards FileSink::Options::fileOptions to openFileForWrite, allowing per-session HDFS write options to reach the native writer. The filesystem cache key is unchanged, since these options affect individual
file-open calls rather than the cached HDFS client connection.

When the options are not set, Bolt continues to pass 0 for these arguments, preserving the existing libhdfs default behavior.

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Paste your google-benchmark or TPC-H results here.
    Before: 10.5s
    After:   8.2s  (+20%)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
-  Added session-level HDFS open-file options for the native HDFS path: `bolt.io.file.buffer.size`, `bolt.dfs.replication`, and `bolt.dfs.blocksize`. These options are applied per file open and preserve existing default behavior when unset.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

Copilot AI review requested due to automatic review settings May 8, 2026 07:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes Hive native HDFS writer behavior so HDFS client “extra configurations” provided via connector/session properties are applied when creating libhdfs clients, and ensures the HDFS filesystem cache is isolated across different effective extra-configuration sets.

Changes:

  • Added hive.hdfs-extra-configurations as a Hive connector property and applied these key/value pairs to the libhdfs builder via BuilderConfSetStr before connecting.
  • Updated the HDFS filesystem cache key to include a normalized digest of the effective extra configurations to avoid cross-contamination between clients with different builder configs.
  • Merged session-level hive.hdfs-extra-configurations into the connector properties used for writer FileSink creation; added unit tests for parsing/cache-key behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
bolt/connectors/hive/storage_adapters/hdfs/tests/HdfsUtilTest.cpp Adds tests for parsing extra configurations and for cache-key digest normalization/non-leakage of raw values.
bolt/connectors/hive/storage_adapters/hdfs/RegisterHdfsFileSystem.cpp Switches the global HDFS filesystem cache to key by (endpoint identity + extra-config digest).
bolt/connectors/hive/storage_adapters/hdfs/HdfsUtil.h Introduces helpers to parse extra configs and compute a normalized fingerprint/digest + cache key.
bolt/connectors/hive/storage_adapters/hdfs/HdfsFileSystem.cpp Applies parsed extra configurations to the libhdfs builder before connecting.
bolt/connectors/hive/HiveDataSink.cpp Overrides connector properties with session hive.hdfs-extra-configurations for writer sink creation.
bolt/connectors/hive/HiveConfig.h Defines the new connector property key constant and documents its encoding format.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread bolt/connectors/hive/storage_adapters/hdfs/HdfsUtil.h Outdated
Comment thread bolt/connectors/hive/storage_adapters/hdfs/RegisterHdfsFileSystem.cpp Outdated
Comment thread bolt/connectors/hive/HiveDataSink.cpp Outdated
@yangzhg
Copy link
Copy Markdown
Collaborator Author

yangzhg commented May 8, 2026

@codex review

@yangzhg yangzhg marked this pull request as draft May 8, 2026 10:41
@yangzhg yangzhg force-pushed the fix/hdfs-replication-extra-conf branch from e4272e5 to 87aad92 Compare May 11, 2026 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Native HDFS writer does not honor HDFS client extra configurations

2 participants