[SPARK-51001][SQL] Refine arrayEquals #49568

cyb70289 · 2025-01-20T06:17:19Z

This is a trivial change to replace the loop index from int to long. Surprisingly, microbenchmark shows more than double performance uplift.

Analysis

The hot loop of arrayEquals method is simplifed as below. Loop index i is defined as int, it's compared with length, which is a long, to determine if the loop should end.

public static boolean arrayEquals(
    Object leftBase, long leftOffset, Object rightBase, long rightOffset, final long length) {
  ......
  int i = 0;
  while (i <= length - 8) {
    if (Platform.getLong(leftBase, leftOffset + i) !=
        Platform.getLong(rightBase, rightOffset + i)) {
          return false;
    }
    i += 8;
  }
  ......
}

Strictly speaking, there's a code bug here. If length is greater than 2^31 + 8, this loop will never end because i as a 32 bit integer is at most 2^31 - 1. But compiler must consider this behaviour as intentional and generate code strictly match the logic. It prevents compiler from generating optimal code.

Defining loop index i as long corrects this issue. Besides more accurate code logic, JIT is able to optimize this code much more aggressively. From microbenchmark, this trivial change improves performance significantly on both Arm and x86 platforms.

Benchmark

Source code:
https://gist.github.com/cyb70289/258e261f388e22f47e4d961431786d1a

Result on Arm Neoverse N2:

Benchmark                             Mode  Cnt    Score   Error  Units
ArrayEqualsBenchmark.arrayEqualsInt   avgt   10  674.313 ± 0.213  ns/op
ArrayEqualsBenchmark.arrayEqualsLong  avgt   10  313.563 ± 2.338  ns/op

Result on Intel Cascake Lake:

Benchmark                             Mode  Cnt     Score   Error  Units
ArrayEqualsBenchmark.arrayEqualsInt   avgt   10  1130.695 ± 0.168  ns/op
ArrayEqualsBenchmark.arrayEqualsLong  avgt   10   461.979 ± 0.097  ns/op

Deep dive

Dive deep to the machine code level, we can see why the big gap. Listed below are arm64 assembly generated by Openjdk-17 C2 compiler.

For int i, the machine code is similar to source code, no deep optimization. Safepoint polling is expensive in this short loop.

// jit c2 machine code snippet
  0x0000ffff81ba8904:   mov        w15, wzr              // int i = 0
  0x0000ffff81ba8908:   nop
  0x0000ffff81ba890c:   nop
loop:
  0x0000ffff81ba8910:   ldr        x10, [x13, w15, sxtw] // Platform.getLong(leftBase, leftOffset + i)
  0x0000ffff81ba8914:   ldr        x14, [x12, w15, sxtw] // Platform.getLong(rightBase, rightOffset + i)
  0x0000ffff81ba8918:   cmp        x10, x14
  0x0000ffff81ba891c:   b.ne       0x0000ffff81ba899c    // return false if not equal
  0x0000ffff81ba8920:   ldr        x14, [x28, #848]      // x14 -> safepoint
  0x0000ffff81ba8924:   add        w15, w15, #0x8        // i += 8
  0x0000ffff81ba8928:   ldr        wzr, [x14]            // safepoint polling
  0x0000ffff81ba892c:   sxtw       x10, w15              // extend i to long
  0x0000ffff81ba8930:   cmp        x10, x11
  0x0000ffff81ba8934:   b.le       0x0000ffff81ba8910    // if (i <= length - 8) goto loop

For long i, JIT is able to do much more aggressive optimization. E.g, below code snippet unrolls the loop by four.

// jit c2 machine code snippet
unrolled_loop:
  0x0000ffff91de6fe0:   sxtw       x10, w7
  0x0000ffff91de6fe4:   add        x23, x22, x10
  0x0000ffff91de6fe8:   add        x24, x21, x10
  0x0000ffff91de6fec:   ldr        x13, [x23]          // unroll-1
  0x0000ffff91de6ff0:   ldr        x14, [x24]
  0x0000ffff91de6ff4:   cmp        x13, x14
  0x0000ffff91de6ff8:   b.ne       0x0000ffff91de70a8
  0x0000ffff91de6ffc:   ldr        x13, [x23, #8]      // unroll-2
  0x0000ffff91de7000:   ldr        x14, [x24, #8]
  0x0000ffff91de7004:   cmp        x13, x14
  0x0000ffff91de7008:   b.ne       0x0000ffff91de70b4
  0x0000ffff91de700c:   ldr        x13, [x23, #16]     // unroll-3
  0x0000ffff91de7010:   ldr        x14, [x24, #16]
  0x0000ffff91de7014:   cmp        x13, x14
  0x0000ffff91de7018:   b.ne       0x0000ffff91de70a4
  0x0000ffff91de701c:   ldr        x13, [x23, #24]     // unroll-4
  0x0000ffff91de7020:   ldr        x14, [x24, #24]
  0x0000ffff91de7024:   cmp        x13, x14
  0x0000ffff91de7028:   b.ne       0x0000ffff91de70b0
  0x0000ffff91de702c:   add        w7, w7, #0x20
  0x0000ffff91de7030:   cmp        w7, w11
  0x0000ffff91de7034:   b.lt       0x0000ffff91de6fe0

What changes were proposed in this pull request?

A trivial change to replace loop index i of method arrayEquals from int to long.

Why are the changes needed?

To improve performance and fix a possible bug.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

the-sakthi · 2025-01-23T17:42:55Z

LGTM

cyb70289 · 2025-01-26T08:26:41Z

hi @srowen , @Yikun , will you have look at this pr? thanks.

srowen · 2025-01-26T12:58:23Z

I think I might make a JIRA for this, just because it is touching an important method. Just for completeness.
But I buy the change, and that is an interesting analysis

This is a trivial change to replace the loop index from `int` to `long`. Surprisingly, microbenchmark shows more than double performance uplift. Analysis -------- The hot loop of `arrayEquals` method is simplifed as below. Loop index `i` is defined as `int`, it's compared with `length`, which is a `long`, to determine if the loop should end. ``` public static boolean arrayEquals( Object leftBase, long leftOffset, Object rightBase, long rightOffset, final long length) { ...... int i = 0; while (i <= length - 8) { if (Platform.getLong(leftBase, leftOffset + i) != Platform.getLong(rightBase, rightOffset + i)) { return false; } i += 8; } ...... } ``` Strictly speaking, there's a code bug here. If `length` is greater than 2^31 + 8, this loop will never end because `i` as a 32 bit integer is at most 2^31 - 1. But compiler must consider this behaviour as intentional and generate code strictly match the logic. It prevents compiler from generating optimal code. Defining loop index `i` as `long` corrects this issue. Besides more accurate code logic, JIT is able to optimize this code much more aggressively. From microbenchmark, this trivial change improves performance significantly on both Arm and x86 platforms. Benchmark --------- Source code: https://gist.github.com/cyb70289/258e261f388e22f47e4d961431786d1a Result on Arm Neoverse N2: Benchmark Mode Cnt Score Error Units ArrayEqualsBenchmark.arrayEqualsInt avgt 10 674.313 ± 0.213 ns/op ArrayEqualsBenchmark.arrayEqualsLong avgt 10 313.563 ± 2.338 ns/op Result on Intel Cascake Lake: Benchmark Mode Cnt Score Error Units ArrayEqualsBenchmark.arrayEqualsInt avgt 10 1130.695 ± 0.168 ns/op ArrayEqualsBenchmark.arrayEqualsLong avgt 10 461.979 ± 0.097 ns/op Deep dive --------- Dive deep to the machine code level, we can see why the big gap. Listed below are arm64 assembly generated by Openjdk-17 C2 compiler. For `int i`, the machine code is similar to source code, no deep optimization. Safepoint polling is expensive in this short loop. ``` // jit c2 machine code snippet 0x0000ffff81ba8904: mov w15, wzr // int i = 0 0x0000ffff81ba8908: nop 0x0000ffff81ba890c: nop loop: 0x0000ffff81ba8910: ldr x10, [x13, w15, sxtw] // Platform.getLong(leftBase, leftOffset + i) 0x0000ffff81ba8914: ldr x14, [x12, w15, sxtw] // Platform.getLong(rightBase, rightOffset + i) 0x0000ffff81ba8918: cmp x10, x14 0x0000ffff81ba891c: b.ne 0x0000ffff81ba899c // return false if not equal 0x0000ffff81ba8920: ldr x14, [x28, apache#848] // x14 -> safepoint 0x0000ffff81ba8924: add w15, w15, #0x8 // i += 8 0x0000ffff81ba8928: ldr wzr, [x14] // safepoint polling 0x0000ffff81ba892c: sxtw x10, w15 // extend i to long 0x0000ffff81ba8930: cmp x10, x11 0x0000ffff81ba8934: b.le 0x0000ffff81ba8910 // if (i <= length - 8) goto loop ``` For `long i`, JIT is able to do much more aggressive optimization. E.g, below code snippet unrolls the loop by four. Safepoint polling does not impact much in this long loop. ``` // jit c2 machine code snippet unrolled_loop: 0x0000ffff91de6fe0: sxtw x10, w7 0x0000ffff91de6fe4: add x23, x22, x10 0x0000ffff91de6fe8: add x24, x21, x10 0x0000ffff91de6fec: ldr x13, [x23] // unroll-1 0x0000ffff91de6ff0: ldr x14, [x24] 0x0000ffff91de6ff4: cmp x13, x14 0x0000ffff91de6ff8: b.ne 0x0000ffff91de70a8 0x0000ffff91de6ffc: ldr x13, [x23, apache#8] // unroll-2 0x0000ffff91de7000: ldr x14, [x24, apache#8] 0x0000ffff91de7004: cmp x13, x14 0x0000ffff91de7008: b.ne 0x0000ffff91de70b4 0x0000ffff91de700c: ldr x13, [x23, apache#16] // unroll-3 0x0000ffff91de7010: ldr x14, [x24, apache#16] 0x0000ffff91de7014: cmp x13, x14 0x0000ffff91de7018: b.ne 0x0000ffff91de70a4 0x0000ffff91de701c: ldr x13, [x23, apache#24] // unroll-4 0x0000ffff91de7020: ldr x14, [x24, apache#24] 0x0000ffff91de7024: cmp x13, x14 0x0000ffff91de7028: b.ne 0x0000ffff91de70b0 0x0000ffff91de702c: add w7, w7, #0x20 0x0000ffff91de7030: cmp w7, w11 0x0000ffff91de7034: b.lt 0x0000ffff91de6fe0 ```

cyb70289 · 2025-01-27T02:31:53Z

Thanks @srowen , I've created a JIRA card.
https://issues.apache.org/jira/browse/SPARK-51001

srowen

Could anyone else review too and merge? I'm having trouble merging.

srowen · 2025-01-28T14:32:47Z

Nevermind I got it

github-actions bot added the SQL label Jan 20, 2025

cyb70289 changed the title ~~common/unsafe: refine arrayEquals~~ [MINOR] common/unsafe: refine arrayEquals Jan 20, 2025

cyb70289 force-pushed the arrayEquals branch from c1588dc to 8ca6eab Compare January 20, 2025 06:20

cyb70289 force-pushed the arrayEquals branch from 8ca6eab to 8a7f338 Compare January 27, 2025 02:30

cyb70289 changed the title ~~[MINOR] common/unsafe: refine arrayEquals~~ [SPARK-51001][SQL] Refine arrayEquals Jan 27, 2025

srowen approved these changes Jan 27, 2025

View reviewed changes

srowen closed this in 80f70a5 Jan 28, 2025

cyb70289 deleted the arrayEquals branch January 29, 2025 05:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51001][SQL] Refine arrayEquals #49568

[SPARK-51001][SQL] Refine arrayEquals #49568

cyb70289 commented Jan 20, 2025

the-sakthi commented Jan 23, 2025

cyb70289 commented Jan 26, 2025

srowen commented Jan 26, 2025

cyb70289 commented Jan 27, 2025

srowen left a comment

srowen commented Jan 28, 2025

[SPARK-51001][SQL] Refine arrayEquals #49568

[SPARK-51001][SQL] Refine arrayEquals #49568

Conversation

cyb70289 commented Jan 20, 2025

Analysis

Benchmark

Deep dive

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

the-sakthi commented Jan 23, 2025

cyb70289 commented Jan 26, 2025

srowen commented Jan 26, 2025

cyb70289 commented Jan 27, 2025

srowen left a comment

Choose a reason for hiding this comment

srowen commented Jan 28, 2025