A Middle Layer for Offloading JVM-based SQL Engines' Execution to Native Engines
Apache Spark is a mature and stable project that has been under continuous development for many years. It is one of the most widely used frameworks for scaling out the processing of petabyte-scale datasets. Over time, the Spark community has had to address significant performance challenges, which required a variety of optimizations. A major milestone came with Spark 2.0, where Whole-Stage Code Generation replaced the Volcano Model, delivering up to a 2× speedup. Since then, most subsequent improvements have focused on the query plan level, while the performance of individual operators has almost stopped improving.
In recent years, several native SQL engines have been developed, such as ClickHouse and Velox. With features like native execution, columnar data formats, and vectorized data processing, these engines can outperform Spark’s JVM-based SQL engine. However, they currently don't directly support Spark SQL execution.
“Gluten” is Latin for "glue". The main goal of the Gluten project is to glue native engines to Spark SQL. Thus, we can benefit from the high performance of native engines and the high scalability enabled by the Spark ecosystem.
The basic design principle is to reuse Spark’s control flow, while offloading compute-intensive data processing to the native side. More specifically:
- Transform Spark’s physical plan to Substrait plan, then transform it to native engine's plan.
- Offload performance-critical data processing to native engine.
- Define clear JNI interfaces for native SQL engines.
- Allow easy switching between available native backends.
- Reuse Spark’s distributed control flow.
- Manage data sharing between JVM and native.
- Provide extensibility to support more native engines.
Gluten's target users include anyone who wants to fundamentally accelerate Spark SQL. As a plugin to Spark, Gluten requires no changes to the DataFrame API or SQL queries; users only need to configure it correctly.
The overview chart is shown below. Substrait provides a well-defined, cross-language specification for data compute operations. Spark’s physical plan is transformed into a Substrait plan, which is then passed to the native side through a JNI call. On the native side, a chain of native operators is constructed and offloaded to the native engine. Gluten returns the results as a ColumnarBatch, and Spark’s Columnar API (introduced in Spark 3.0) is used during execution. Gluten adopts the Apache Arrow data format as its underlying representation.
Currently, Gluten supports only ClickHouse and Velox backends. Velox is a C++ database acceleration library which provides reusable, extensible and high-performance data processing components. In addition, Gluten is designed to be extensible, allowing support for additional backends in the future.Gluten's key components:
- Query Plan Conversion: Converts Spark's physical plan to Substrait plan.
- Unified Memory Management: Manages native memory allocation.
- Columnar Shuffle: Handles shuffling of Gluten's columnar data. The shuffle service of Spark core is reused, while a columnar exchange operator is implemented to support Gluten's columnar data format.
- Fallback Mechanism: Provides fallback to vanilla Spark for unsupported operators. Gluten's ColumnarToRow (C2R) and RowToColumnar (R2C) convert data between Gluten's columnar format and Spark's internal row format to support fallback transitions.
- Metrics: Collected from Gluten native engine to help monitor execution, identify bugs, and diagnose performance bottlenecks. The metrics are displayed in Spark UI.
- Shim Layer: Ensures compatibility with multiple Spark versions. Gluten supports the latest 3–4 Spark releases during its development cycle, and currently supports Spark 3.2, 3.3, 3.4, and 3.5.
Below is a basic configuration to enable Gluten in Spark.
export GLUTEN_JAR=/PATH/TO/GLUTEN_JAR
spark-shell \
--master yarn --deploy-mode client \
--conf spark.plugins=org.apache.gluten.GlutenPlugin \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=20g \
--conf spark.driver.extraClassPath=${GLUTEN_JAR} \
--conf spark.executor.extraClassPath=${GLUTEN_JAR} \
--conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager
...
There are two ways to acquire Gluten jar for the above configuration.
Please download the tar package here, then extract Gluten JAR from it. Additionally, Gluten provides nightly builds based on the main branch for early testing. The nightly build JARs are available at Apache Gluten Nightlies. They have been verified on Centos 7/8/9, Ubuntu 20.04/22.04.
For Velox backend, please refer to Velox.md and build-guide.md.
For ClickHouse backend, please refer to ClickHouse.md.
The Gluten JAR will be generated under /PATH/TO/GLUTEN/package/target/
after the build.
Common configurations used by Gluten are listed in Configuration.md. Velox specific configurations are listed in velox-configuration.md.
The Gluten Velox backend honors some Spark configurations, ignores others, and many are transparent to it. See velox-spark-configuration.md for details, and velox-parquet-write-configuration.md for Parquet write configurations.
- Gluten website
- Velox repository
- ClickHouse repository
- Gluten Intro Video at Data AI Summit 2022
- Gluten Intro Article on Medium
- Gluten Intro Article on Kyligence.io (Chinese)
- Velox Intro from Meta
Welcome to contribute to the Gluten project! See CONTRIBUTING.md for guidelines on how to make contributions.
Gluten successfully became an Apache Incubator project in March 2024. Here are several ways to connect with the community.
Welcome to report issues or start discussions in GitHub. Please search the GitHub issue list before creating a new one to avoid duplication.
For any technical discussions, please email [email protected]. You can browse the archives to view past discussions, or subscribe to the mailing list to receive updates.
Request an invitation to the ASF Slack workspace via this page. Once invited, you can join the incubator-gluten channel.
The ASF Slack login entry: https://the-asf.slack.com/.
Please contact weitingchen at apache.org or zhangzc at apache.org to request an invitation to the WeChat group. It is for Chinese-language communication.
TPC-H is used to evaluate Gluten's performance. Please note that the results below do not reflect the latest performance.
The Gluten Velox backend demonstrated an overall speedup of 2.71x, with up to a 14.53x speedup observed in a single query.
Tested in Jun. 2023. Test environment: single node with 2TB data, using Spark 3.3.2 as the baseline and with Gluten integrated into the same Spark version.
ClickHouse backend demonstrated an average speedup of 2.12x, with up to 3.48x speedup observed in a single query.
Test environment: a 8-nodes AWS cluster with 1TB data, using Spark 3.1.1 as the baseline and with Gluten integrated into the same Spark version.
The Qualification Tool is a utility to analyze Spark event log files and assess the compatibility and performance of SQL workloads with Gluten. This tool helps users understand how their workloads can benefit from Gluten.
Gluten is licensed under Apache 2.0 license.
Gluten was initiated by Intel and Kyligence in 2022. Several other companies are also actively contributing to its development, including BIGO, Meituan, Alibaba Cloud, NetEase, Baidu, Microsoft, IBM, Google, etc.
* LEGAL NOTICE: Your use of this software and any required dependent software (the "Software Package") is subject to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party or open source software included in or with the Software Package, and your use indicates your acceptance of all such terms. Please refer to the "TPP.txt" or other similarly-named text file included with the Software Package for additional details.