-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Vectorize block deserialization #27586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Vectorize block deserialization #27586
Conversation
e5ce7fe to
47bf928
Compare
|
Started benchmark workflow for this PR with test type =
|
|
Started benchmark workflow for this PR with test type =
|
47bf928 to
20f372c
Compare
|
The regression on TPCDS q09 seems like an ongoing issue since that keeps getting flagged on multiple PR's. @raunaqmorarka - did we ever figure out when that regression was introduced? |
40b89e2 to
2339568
Compare
| } | ||
| } | ||
|
|
||
| private static IntArrayBlock expandIntsWithNullsVectorized(SliceInput sliceInput, int positionCount, boolean[] valueIsNull) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we want to put the expandIntsWithNullsVectorized and compactIntsWithNullsVectorized in the same place place: Either in EncoderUtil or in the encoders for individual types.
| sliceInput.readBytes(values, nonNullIndex, nonNullPositionCount); | ||
|
|
||
| int position = 0; | ||
| for (; position < nonNullIndex && nonNullIndex + BYTE_SPECIES.length() < values.length; position += BYTE_SPECIES.length()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you use BYTE_SPECIES.loopBound(values.length) here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check is a little different. For instance, the following is a potentially valid result: SPECIES.loopBound(16) # => 16 is typical result- but we're loading a whole vector of non-null values starting from the nonNullIndex which:
- doesn't start from
0 - advances in arbitrary step sizes, not
SPECIES.length()steps - must be able to load a full vector from the current index
core/trino-spi/src/main/java/io/trino/spi/block/ByteArrayBlockEncoding.java
Show resolved
Hide resolved
core/trino-spi/src/main/java/io/trino/spi/block/IntArrayBlockEncoding.java
Show resolved
Hide resolved
| import static io.trino.spi.block.EncoderUtil.retrieveNullBits; | ||
| import static java.lang.System.arraycopy; | ||
|
|
||
| public class ByteArrayBlockEncoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably you cna add tests to TestEncoderUtil on the expand code path.
836dc77 to
1c452f6
Compare
Adds vectorized null expansion during deserialization for x86 and aarch64 CPUs with support for SVE 2 or higher. JDK 25 currently has no intrinsics to support "expand" on aarch64 CPUs with SVE 1 support only.
Previously, graviton 4 CPUs would not enable vectorized null suppression during serialization or deserialization because they only have 128-bit vector registers. However, vectorizing block serialization and deserialization is still beneficial on graviton 4 instances when the corresponding vector intrinsics are present for all types except for long (which only processes 2 values per instruction at 128 bit widths). This PR enables vectorized serialization for byte, short, and int types, but not long values. This PR also enabled vectorized deserialization for int values, but not byte or short since no intrinsics are present as of JDK 25.
1c452f6 to
8766cf4
Compare
Description
Extends the SIMD support in block encoding introduced by #26919 by adding SIMD logic for deserializing block values with nulls present.
This requires splitting the SIMD support detection logic for compress / expand operations from one another since ARM intrinsics for expand only land in JDK 26+ as part of openjdk/jdk#26740
This PR also enables vectorized serialization on graviton 4 CPU's that previously would not have it enabled due to only having 128-bit vector registers.
Summary of Changes for Common CPU's:
intandlongenabledbyte,short,int,longbyte,short, andint(but notlong) enabled. Vectorized deserialization ofintenabled.Benchmarks
Switched to throughput (full benchmark iterations per second) instead of nano seconds per position to avoid rounding issues in the benchmark UI. Benchmark command used:
./mvnw exec:exec -pl :trino-main -Dair.check.skip-all=true -Dexec.classpathScope=test -Dexec.executable="java" -Dexec.args="-cp %classpath org.openjdk.jmh.Main -tu s -opi 1 -bm thrpt -f 1 -wi 10 -i 5 -rf json -rff ./deserialize_all_vectorized_throughput.json -jvmArgsPrepend '-Xms8g -Xmx8g -XX:+AlwaysPreTouch -XX:+EnableDynamicAgentLoading --add-modules jdk.incubator.vector -XX:ReservedCodeCacheSize=256M -Djdk.attach.allowAttachSelf' io.trino.execution.buffer.BenchmarkBlockSerde.deserialize"Intel Skylake (c5 instance): BenchmarkBlockSerDe.deserialize*
Intel Sapphire Rapids (c7i instance): BenchmarkBlockSerDe.deserialize*
Graviton 4 (c8g instance): BenchmarkBlockSerDe
intRelease notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: