Vectorize block deserialization #27586

pettyjamesm · 2025-12-08T20:50:28Z

Description

Extends the SIMD support in block encoding introduced by #26919 by adding SIMD logic for deserializing block values with nulls present.

This requires splitting the SIMD support detection logic for compress / expand operations from one another since ARM intrinsics for expand only land in JDK 26+ as part of openjdk/jdk#26740

This PR also enables vectorized serialization on graviton 4 CPU's that previously would not have it enabled due to only having 128-bit vector registers.

Summary of Changes for Common CPU's:

amd64 with AVX512F only (e.g.: Intel Sky Lake): Vectorized deserialization of int and long enabled
amd64 with AVX512F and AVX512VBMI2 (e.g.: Intel Sapphire Rapids, AMD Zen 4 and 5): vectorized deserialization enabled for byte, short, int, long
aarch64 with NEON support only (e.g.: Graviton 1 and 2 or Apple Silicon): No change
aarch64 with SVE 1 support only (e.g.: Graviton 3): No change
aarch64 with SVE 2 support (e.g.: Graviton 4): Vectorized serialization of byte, short, and int (but not long) enabled. Vectorized deserialization of int enabled.

Benchmarks

Switched to throughput (full benchmark iterations per second) instead of nano seconds per position to avoid rounding issues in the benchmark UI. Benchmark command used:
./mvnw exec:exec -pl :trino-main -Dair.check.skip-all=true -Dexec.classpathScope=test -Dexec.executable="java" -Dexec.args="-cp %classpath org.openjdk.jmh.Main -tu s -opi 1 -bm thrpt -f 1 -wi 10 -i 5 -rf json -rff ./deserialize_all_vectorized_throughput.json -jvmArgsPrepend '-Xms8g -Xmx8g -XX:+AlwaysPreTouch -XX:+EnableDynamicAgentLoading --add-modules jdk.incubator.vector -XX:ReservedCodeCacheSize=256M -Djdk.attach.allowAttachSelf' io.trino.execution.buffer.BenchmarkBlockSerde.deserialize"

Intel Skylake (c5 instance): BenchmarkBlockSerDe.deserialize*

Only supports AVX512F which enables vectorized deserialization for int and longs, but not shorts or bytes
Minor to significant speedups for long and integer array deserialization with nulls after special casing null-heavy handling.

Intel Sapphire Rapids (c7i instance): BenchmarkBlockSerDe.deserialize*

Supports AVX512F and AVX512VBMI2 which enabled vectorized deserialization for int, long, short, and byte
Significant deserialization speedups for all vectorized types, up to 3.5x throughput increase

Graviton 4 (c8g instance): BenchmarkBlockSerDe

Enabled support for vectorized serialization for byte, short, and int but not long
Enabled support for vectorized deserialization of int only
Significant speedups for all vectorized operations, with a slight regression at 99% null rate deserialization for int

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## General
* Extend experimental performance improvements for remote data exchanges on newer CPU architectures ({issue}`27586`)

starburstdata-automation · 2025-12-10T11:07:17Z

Started benchmark workflow for this PR with test type = iceberg/sf1000_parquet_part.

Building Trino finished with status: success
Benchmark finished with status: failure
Comparing results to the static baseline values, follow above workflow link for more details/logs.
Status message: Found regressions for:<br/>(presto/tpcds, q09, totalCpuTime, over by 14.1%)
Benchmark Comparison to the closest run from Master: Report

starburstdata-automation · 2025-12-10T11:07:27Z

Started benchmark workflow for this PR with test type = iceberg/sf1000_parquet_unpart.

Building Trino finished with status: success
Benchmark finished with status: success
Comparing results to the static baseline values, follow above workflow link for more details/logs.
Status message: NO Regression found.
Benchmark Comparison to the closest run from Master: Report

pettyjamesm · 2025-12-10T15:09:32Z

The regression on TPCDS q09 seems like an ongoing issue since that keeps getting flagged on multiple PR's. @raunaqmorarka - did we ever figure out when that regression was introduced?

SongChujun · 2025-12-10T16:45:17Z

core/trino-spi/src/main/java/io/trino/spi/block/IntArrayBlockEncoding.java

+        }
+    }
+
+    private static IntArrayBlock expandIntsWithNullsVectorized(SliceInput sliceInput, int positionCount, boolean[] valueIsNull)


Probably we want to put the expandIntsWithNullsVectorized and compactIntsWithNullsVectorized in the same place place: Either in EncoderUtil or in the encoders for individual types.

SongChujun · 2025-12-10T19:55:29Z

core/trino-spi/src/main/java/io/trino/spi/block/ByteArrayBlockEncoding.java

+        sliceInput.readBytes(values, nonNullIndex, nonNullPositionCount);
+
+        int position = 0;
+        for (; position < nonNullIndex && nonNullIndex + BYTE_SPECIES.length() < values.length; position += BYTE_SPECIES.length()) {


Why don't you use BYTE_SPECIES.loopBound(values.length) here?

The check is a little different. For instance, the following is a potentially valid result: SPECIES.loopBound(16) # => 16 is typical result- but we're loading a whole vector of non-null values starting from the nonNullIndex which:

doesn't start from 0

advances in arbitrary step sizes, not SPECIES.length() steps

must be able to load a full vector from the current index

core/trino-spi/src/main/java/io/trino/spi/block/ByteArrayBlockEncoding.java

core/trino-spi/src/main/java/io/trino/spi/block/IntArrayBlockEncoding.java

SongChujun · 2025-12-10T21:07:20Z

core/trino-spi/src/main/java/io/trino/spi/block/ByteArrayBlockEncoding.java

 import static io.trino.spi.block.EncoderUtil.retrieveNullBits;
 import static java.lang.System.arraycopy;

 public class ByteArrayBlockEncoding


Probably you cna add tests to TestEncoderUtil on the expand code path.

Adds vectorized null expansion during deserialization for x86 and aarch64 CPUs with support for SVE 2 or higher. JDK 25 currently has no intrinsics to support "expand" on aarch64 CPUs with SVE 1 support only.

Previously, graviton 4 CPUs would not enable vectorized null suppression during serialization or deserialization because they only have 128-bit vector registers. However, vectorizing block serialization and deserialization is still beneficial on graviton 4 instances when the corresponding vector intrinsics are present for all types except for long (which only processes 2 values per instruction at 128 bit widths). This PR enables vectorized serialization for byte, short, and int types, but not long values. This PR also enabled vectorized deserialization for int values, but not byte or short since no intrinsics are present as of JDK 25.

cla-bot bot added the cla-signed label Dec 8, 2025

pettyjamesm force-pushed the vectorize-block-deserialize branch 4 times, most recently from e5ce7fe to 47bf928 Compare December 8, 2025 21:20

pettyjamesm changed the title ~~Vectorize block deserialize~~ Vectorize block deserialization Dec 8, 2025

pettyjamesm requested a review from SongChujun December 9, 2025 19:22

pettyjamesm force-pushed the vectorize-block-deserialize branch from 47bf928 to 20f372c Compare December 10, 2025 15:05

pettyjamesm force-pushed the vectorize-block-deserialize branch 3 times, most recently from 40b89e2 to 2339568 Compare December 10, 2025 21:06

SongChujun reviewed Dec 10, 2025

View reviewed changes

pettyjamesm force-pushed the vectorize-block-deserialize branch 4 times, most recently from 836dc77 to 1c452f6 Compare December 11, 2025 20:57

pettyjamesm added 2 commits December 11, 2025 16:51

Vectorize null expansion in deserialization

e48d078

Adds vectorized null expansion during deserialization for x86 and aarch64 CPUs with support for SVE 2 or higher. JDK 25 currently has no intrinsics to support "expand" on aarch64 CPUs with SVE 1 support only.

pettyjamesm force-pushed the vectorize-block-deserialize branch from 1c452f6 to 8766cf4 Compare December 11, 2025 21:51

pettyjamesm marked this pull request as ready for review December 11, 2025 22:57

pettyjamesm requested review from electrum and raunaqmorarka December 11, 2025 23:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vectorize block deserialization #27586

Vectorize block deserialization #27586

pettyjamesm commented Dec 8, 2025 •

edited

Loading

Uh oh!

starburstdata-automation commented Dec 10, 2025 •

edited

Loading

Uh oh!

starburstdata-automation commented Dec 10, 2025 •

edited

Loading

Uh oh!

pettyjamesm commented Dec 10, 2025

Uh oh!

SongChujun Dec 10, 2025

Uh oh!

SongChujun Dec 10, 2025

Uh oh!

pettyjamesm Dec 10, 2025

Uh oh!

Uh oh!

Uh oh!

SongChujun Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Vectorize block deserialization #27586

Are you sure you want to change the base?

Vectorize block deserialization #27586

Conversation

pettyjamesm commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Benchmarks

Release notes

Uh oh!

starburstdata-automation commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

starburstdata-automation commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pettyjamesm commented Dec 10, 2025

Uh oh!

SongChujun Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

SongChujun Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

pettyjamesm Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SongChujun Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

pettyjamesm commented Dec 8, 2025 •

edited

Loading

starburstdata-automation commented Dec 10, 2025 •

edited

Loading

starburstdata-automation commented Dec 10, 2025 •

edited

Loading