Skip to content

Conversation

@Vishwanatha-HD
Copy link
Contributor

@Vishwanatha-HD Vishwanatha-HD commented Nov 21, 2025

Rationale for this change

This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes the "util/byte_stream_split_internal" logic.

What changes are included in this PR?

The fix includes changes to following file:
cpp/src/arrow/util/byte_stream_split_internal.h

Are these changes tested?

Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.

Are there any user-facing changes?

No.

GitHub main Issue link: #48151

@github-actions
Copy link

⚠️ GitHub issue #48216 has been automatically assigned in GitHub to PR creator.

@Vishwanatha-HD Vishwanatha-HD force-pushed the fixUtilByteStreamSpltIntrnl branch from 189bd42 to 1c499e4 Compare November 22, 2025 05:05
@kou kou changed the title GH-48216 Fix Util Byte Stream Split Internal logic to enable Parquet … GH-48216: [C++][Parquet] Fix Util Byte Stream Split Internal logic to enable Parquet DB support on s390x Nov 22, 2025
Comment on lines +333 to +337
#if !ARROW_LITTLE_ENDIAN
const int src_stream = width - 1 - stream;
#else
const int src_stream = stream;
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use

#if ARROW_LITTLE_ENDIAN
  little endian code
#else
  big endian code
#endif

instead of

#if !ARROW_LITTLE_ENDIAN
  big endian code
#else
  little endian code
#endif

for readability. In general, less ! is easier to read.

@k8ika0s
Copy link

k8ika0s commented Nov 23, 2025

@Vishwanatha-HD

Something I’ve seen on s390x is that ByteStreamSplit behaves most predictably when the data feeding into the split is already in a well-defined byte order before the interleaving happens. When values arrive in native order on BE, the shuffling pattern can produce different byte layouts than what downstream readers or stats logic expect on LE hosts.

Looking at this patch, the swap + reversed-stream approach inside DoSplitStreams makes sense mechanically. I was wondering, though, how this interacts with callers that assume the inputs are already LE-normalized. In particular, mixed Arrow/non-Arrow inputs sometimes reveal subtle differences because Arrow arrays tend to carry scalars in canonical LE format even on BE machines.

On the merge side, I’m also curious whether the current stream reversal covers the cases where BE decoding would otherwise lean on helpers that expect the shuffled bytes to correspond to LE-origin data.

Not raising any correctness objections here — just sharing a few behaviors I’ve run into while testing BSS more broadly on BE systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants