Skip to content

Conversation

@pettyjamesm
Copy link
Member

@pettyjamesm pettyjamesm commented Dec 8, 2025

Description

Extends the SIMD support in block encoding introduced by #26919 by adding SIMD logic for deserializing block values with nulls present.

This requires splitting the SIMD support detection logic for compress / expand operations from one another since ARM intrinsics for expand only land in JDK 26+ as part of openjdk/jdk#26740

This PR also enables vectorized serialization on graviton 4 CPU's that previously would not have it enabled due to only having 128-bit vector registers.

Summary of Changes for Common CPU's:

  • amd64 with AVX512F only (e.g.: Intel Sky Lake): Vectorized deserialization of int and long enabled
  • amd64 with AVX512F and AVX512VBMI2 (e.g.: Intel Sapphire Rapids, AMD Zen 4 and 5): vectorized deserialization enabled for byte, short, int, long
  • aarch64 with NEON support only (e.g.: Graviton 1 and 2 or Apple Silicon): No change
  • aarch64 with SVE 1 support only (e.g.: Graviton 3): No change
  • aarch64 with SVE 2 support (e.g.: Graviton 4): Vectorized serialization of byte, short, and int (but not long) enabled. Vectorized deserialization of int enabled.

Benchmarks

Switched to throughput (full benchmark iterations per second) instead of nano seconds per position to avoid rounding issues in the benchmark UI. Benchmark command used:
./mvnw exec:exec -pl :trino-main -Dair.check.skip-all=true -Dexec.classpathScope=test -Dexec.executable="java" -Dexec.args="-cp %classpath org.openjdk.jmh.Main -tu s -opi 1 -bm thrpt -f 1 -wi 10 -i 5 -rf json -rff ./deserialize_all_vectorized_throughput.json -jvmArgsPrepend '-Xms8g -Xmx8g -XX:+AlwaysPreTouch -XX:+EnableDynamicAgentLoading --add-modules jdk.incubator.vector -XX:ReservedCodeCacheSize=256M -Djdk.attach.allowAttachSelf' io.trino.execution.buffer.BenchmarkBlockSerde.deserialize"

Intel Skylake (c5 instance): BenchmarkBlockSerDe.deserialize*

  • Only supports AVX512F which enables vectorized deserialization for int and longs, but not shorts or bytes
  • Minor to significant speedups for long and integer array deserialization with nulls after special casing null-heavy handling.

Intel Sapphire Rapids (c7i instance): BenchmarkBlockSerDe.deserialize*

  • Supports AVX512F and AVX512VBMI2 which enabled vectorized deserialization for int, long, short, and byte
  • Significant deserialization speedups for all vectorized types, up to 3.5x throughput increase

Graviton 4 (c8g instance): BenchmarkBlockSerDe

  • Enabled support for vectorized serialization for byte, short, and int but not long
  • Enabled support for vectorized deserialization of int only
  • Significant speedups for all vectorized operations, with a slight regression at 99% null rate deserialization for int

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## General
* Extend experimental performance improvements for remote data exchanges on newer CPU architectures ({issue}`27586`)

@cla-bot cla-bot bot added the cla-signed label Dec 8, 2025
@pettyjamesm pettyjamesm force-pushed the vectorize-block-deserialize branch 4 times, most recently from e5ce7fe to 47bf928 Compare December 8, 2025 21:20
@pettyjamesm pettyjamesm changed the title Vectorize block deserialize Vectorize block deserialization Dec 8, 2025
@starburstdata-automation
Copy link

starburstdata-automation commented Dec 10, 2025

Started benchmark workflow for this PR with test type = iceberg/sf1000_parquet_part.

Building Trino finished with status: success
Benchmark finished with status: failure
Comparing results to the static baseline values, follow above workflow link for more details/logs.
Status message: Found regressions for:<br/>(presto/tpcds, q09, totalCpuTime, over by 14.1%)
Benchmark Comparison to the closest run from Master: Report

@starburstdata-automation
Copy link

starburstdata-automation commented Dec 10, 2025

Started benchmark workflow for this PR with test type = iceberg/sf1000_parquet_unpart.

Building Trino finished with status: success
Benchmark finished with status: success
Comparing results to the static baseline values, follow above workflow link for more details/logs.
Status message: NO Regression found.
Benchmark Comparison to the closest run from Master: Report

@pettyjamesm pettyjamesm force-pushed the vectorize-block-deserialize branch from 47bf928 to 20f372c Compare December 10, 2025 15:05
@pettyjamesm
Copy link
Member Author

The regression on TPCDS q09 seems like an ongoing issue since that keeps getting flagged on multiple PR's. @raunaqmorarka - did we ever figure out when that regression was introduced?

@pettyjamesm pettyjamesm force-pushed the vectorize-block-deserialize branch 3 times, most recently from 40b89e2 to 2339568 Compare December 10, 2025 21:06
}
}

private static IntArrayBlock expandIntsWithNullsVectorized(SliceInput sliceInput, int positionCount, boolean[] valueIsNull)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we want to put the expandIntsWithNullsVectorized and compactIntsWithNullsVectorized in the same place place: Either in EncoderUtil or in the encoders for individual types.

sliceInput.readBytes(values, nonNullIndex, nonNullPositionCount);

int position = 0;
for (; position < nonNullIndex && nonNullIndex + BYTE_SPECIES.length() < values.length; position += BYTE_SPECIES.length()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use BYTE_SPECIES.loopBound(values.length) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is a little different. For instance, the following is a potentially valid result: SPECIES.loopBound(16) # => 16 is typical result- but we're loading a whole vector of non-null values starting from the nonNullIndex which:

  1. doesn't start from 0
  2. advances in arbitrary step sizes, not SPECIES.length() steps
  3. must be able to load a full vector from the current index

import static io.trino.spi.block.EncoderUtil.retrieveNullBits;
import static java.lang.System.arraycopy;

public class ByteArrayBlockEncoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably you cna add tests to TestEncoderUtil on the expand code path.

@pettyjamesm pettyjamesm force-pushed the vectorize-block-deserialize branch 4 times, most recently from 836dc77 to 1c452f6 Compare December 11, 2025 20:57
Adds vectorized null expansion during deserialization for x86 and
aarch64 CPUs with support for SVE 2 or higher. JDK 25 currently has no
intrinsics to support "expand" on aarch64 CPUs with SVE 1 support only.
Previously, graviton 4 CPUs would not enable vectorized null suppression
during serialization or deserialization because they only have 128-bit
vector registers. However, vectorizing block serialization and deserialization
is still beneficial on graviton 4 instances when the corresponding vector
intrinsics are present for all types except for long (which only
processes 2 values per instruction at 128 bit widths).

This PR enables vectorized serialization for byte, short, and int types,
but not long values.

This PR also enabled vectorized deserialization for int values, but not
byte or short since no intrinsics are present as of JDK 25.
@pettyjamesm pettyjamesm force-pushed the vectorize-block-deserialize branch from 1c452f6 to 8766cf4 Compare December 11, 2025 21:51
@pettyjamesm pettyjamesm marked this pull request as ready for review December 11, 2025 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants