15
15
-->
16
16
17
17
# BLAZE
18
- [ ![ test] ( https://github.com/blaze-init/blaze-rs/actions/workflows/rust.yml/badge.svg )] ( https://github.com/blaze-init/blaze-rs/actions/workflows/rust.yml )
19
- [ ![ test] ( https://github.com/blaze-init/blaze-rs/actions/workflows/tpcds.yml/badge.svg )] ( https://github.com/blaze-init/blaze-rs/actions/workflows/tpcds.yml )
20
-
21
18
22
19
The Blaze accelerator for Apache Spark leverages native vectorized execution to accelerate query processing. It combines
23
20
the power of the [ Apache Arrow-DataFusion] ( https://arrow.apache.org/datafusion/ ) library and the scale of the Spark distributed
@@ -28,11 +25,12 @@ plan computation in Spark executors.
28
25
29
26
Blaze is composed of the following high-level components:
30
27
31
- - ** Blaze Spark Extension** : hooks the whole accelerator into Spark execution lifetime.
32
- - ** Native Operators** : defines how each SparkPlan maps to its native execution counterparts.
33
- - ** JNI Gateways** : passing data and control through JNI boundaries.
34
- - ** Plan SerDe** : serialization and deserialization of DataFusion plan with protobuf.
35
- - ** Columnarized Shuffle** : shuffle data file organized in Arrow-IPC format.
28
+ - ** Spark Extension** : hooks the whole accelerator into Spark execution lifetime.
29
+ - ** Spark Shims** : specialized codes for different versions of spark.
30
+ - ** Native Engine** : implements the native engine in rust, including:
31
+ - ExecutionPlan protobuf specification
32
+ - JNI gateway
33
+ - Customized operators, expressions, functions
36
34
37
35
Based on the inherent well-defined extensibility of DataFusion, Blaze can be easily extended to support:
38
36
@@ -50,87 +48,75 @@ To build Blaze, please follow the steps below:
50
48
51
49
1 . Install Rust
52
50
53
- The underlying native execution lib, DataFusion, is written in Rust Lang . So you're required to install Rust first for
54
- compilation. We recommend you to use ` rustup ` .
51
+ The native execution lib is written in Rust. So you're required to install Rust (nightly) first for
52
+ compilation. We recommend you to use [ rustup] ( https://rustup.rs/ ) .
55
53
56
- ``` shell
57
- curl --proto ' =https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
58
- ```
54
+ 2 . Install JDK+Maven
59
55
60
- 2 . Check out the source code.
56
+ Blaze has been well tested on jdk8 and maven3.5, should work fine with higher versions.
57
+
58
+ 3 . Check out the source code.
61
59
62
60
``` shell
63
61
git clone
[email protected] :blaze-init/blaze.git
64
62
cd blaze
65
63
```
66
64
67
- 3 . Build the project.
65
+ 4 . Build the project.
68
66
69
- _ You could either build Blaze in debug mode for testing purposes or in release mode to unlock the full potential of
67
+ _ Specify shims package of which spark version that you would like to run on._
68
+ _ You could either build Blaze in dev mode for debugging or in release mode to unlock the full potential of
70
69
Blaze._
71
70
72
71
``` shell
73
- ./gradlew -Pmode=[dev| release-lto] build
72
+ SHIM=spark333 # or spark303
73
+ MODE=release # or dev
74
+ mvn package -P" ${SHIM} " -P" ${MODE} "
74
75
```
75
76
76
77
After the build is finished, a fat Jar package that contains all the dependencies will be generated in the ` target `
77
78
directory.
78
79
79
- ## Run Spark Job with Blaze Accelerator
80
-
81
- This section describes how to submit and configure a Spark Job with Blaze support.
82
-
83
- You could enable Blaze accelerator through:
80
+ ## Build with docker
84
81
82
+ _ You can use the following command to build a centos-7 compatible release:_
85
83
``` shell
86
- $SPARK_HOME /bin/spark-[sql| submit] \
87
- --jars " /path/to/blaze-engine-1.0-SNAPSHOT.jar" \
88
- --conf spark.sql.extensions=org.apache.spark.sql.blaze.BlazeSparkSessionExtension \
89
- --conf spark.executor.extraClassPath=" ./blaze-engine-1.0-SNAPSHOT.jar" \
90
- .... # your original arguments goes here
84
+ SHIM=spark333 MODE=release ./release-docker.sh
91
85
```
92
86
93
- At the same time, there are a series of configurations that you can use to control Blaze with more granularity.
94
-
95
- | Parameter | Default value | Description |
96
- | -------------------------------------------------------------------| -----------------------| --------------------------------------------------------------------------------------------------|
97
- | spark.executor.memoryOverhead | executor.memory * 0.1 | The amount of non-heap memory to be allocated per executor. Blaze would use this part of memory. |
98
- | spark.blaze.memoryFraction | 0.6 | A fraction of the off-heap that Blaze could use during execution. |
99
- | spark.blaze.batchSize | 10000 | Batch size for vectorized execution. |
100
- | spark.blaze.enable.shuffle | true | If enabled, use native, Arrow-IPC based Shuffle. |
101
- | spark.blaze.enable.[ scan,project,filter,sort,union,sortmergejoin] | true | If enabled, offload the corresponding operator to native engine. |
102
-
87
+ ## Run Spark Job with Blaze Accelerator
103
88
104
- ## Performance
89
+ This section describes how to submit and configure a Spark Job with Blaze support.
105
90
106
- We periodically benchmark Blaze locally with a 1 TB TPC-DS Dataset to show our latest results and prevent unnoticed
107
- performance regressions. Check [ Benchmark Results] ( ./benchmark-results/tpc-ds.md ) with the latest date for the performance
108
- comparison with vanilla Spark.
91
+ 1 . move blaze jar package to spark client classpath (normally ` spark-xx.xx.xx/jars/ ` ).
109
92
110
- Currently, you can expect up to a 2x performance boost, cutting resource consumption to 1/5 within several keystrokes.
111
- Stay tuned and join us for more upcoming thrilling numbers.
93
+ 2 . add the follow confs to spark configuration in ` spark-xx.xx.xx/conf/spark-default.conf ` :
94
+ ``` properties
95
+ spark.sql.extensions org.apache.spark.sql.blaze.BlazeSparkSessionExtension
96
+ spark.shuffle.manager org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager
112
97
113
- ![ 20220522-memcost] ( ./benchmark-results/blaze-prev-20220522.png )
98
+ # other blaze confs defined in spark-extension/src/main/java/org/apache/spark/sql/blaze/BlazeConf.java
99
+ ```
114
100
115
- We also encourage you to benchmark Blaze locally and share the results with us. 🤗
101
+ 3 . submit a query with spark-sql, or other tools like spark-thriftserver:
102
+ ``` shell
103
+ spark-sql -f tpcds/q01.sql
104
+ ```
116
105
117
- ## Roadmap
118
- ### 1. Operators
106
+ ## Performance
119
107
120
- Currently, there are still several operators that we cannot execute natively:
121
- - Aggregate. Relies on https://github.com/apache/arrow-datafusion/issues/1570 .
122
- - Join with an optional filter condition. Relies on https://github.com/apache/arrow-datafusion/issues/2509 .
123
- - Broadcast HashJoin.
124
- - Window.
108
+ Check [ Benchmark Results] ( ./benchmark-results/20230925.md ) with the latest date for the performance
109
+ comparison with vanilla Spark on TPC-DS 1TB dataset. The benchmark result shows that Blaze saved
110
+ ~ 40% query time and ~ 45% cluster resources in average. ~ 5x performance achieved for the best case (q06).
111
+ Stay tuned and join us for more upcoming thrilling numbers.
125
112
126
- ### 2. Compressed Shuffle
113
+ Query time:
114
+ ![ 20230925-query-time] ( ./benchmark-results/blaze-query-time-comparison-20230925.png )
127
115
128
- We use segmented Arrow-IPC files to express shuffle data. If we could apply IPC compression,
129
- we would benefit more from Shuffle since columnar data would have a better compression ratio. Tracked in [ # 4 ] ( https://github.com/blaze-init /blaze/issues/4 ) .
116
+ Cluster resources:
117
+ ![ 20230925-resources ] ( ./benchmark-results /blaze-executor-time-comparison-20230925.png )
130
118
131
- ### 3. UDF support
132
- We would like to have a high-performance JVM-UDF invocation framework that could utilize a great variety
133
- of the existing UDFs written in Spark/Hive language. They are not supported natively in Blaze at the moment.
119
+ We also encourage you to benchmark Blaze and share the results with us. 🤗
134
120
135
121
## Community
136
122
0 commit comments