Skip to content

Commit 1803fbd

Browse files
author
zhangli20
committed
update docs
1 parent 3b48eaa commit 1803fbd

11 files changed

+304
-1135
lines changed

README.md

Lines changed: 44 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,6 @@
1515
-->
1616

1717
# BLAZE
18-
[![test](https://github.com/blaze-init/blaze-rs/actions/workflows/rust.yml/badge.svg)](https://github.com/blaze-init/blaze-rs/actions/workflows/rust.yml)
19-
[![test](https://github.com/blaze-init/blaze-rs/actions/workflows/tpcds.yml/badge.svg)](https://github.com/blaze-init/blaze-rs/actions/workflows/tpcds.yml)
20-
2118

2219
The Blaze accelerator for Apache Spark leverages native vectorized execution to accelerate query processing. It combines
2320
the power of the [Apache Arrow-DataFusion](https://arrow.apache.org/datafusion/) library and the scale of the Spark distributed
@@ -28,11 +25,12 @@ plan computation in Spark executors.
2825

2926
Blaze is composed of the following high-level components:
3027

31-
- **Blaze Spark Extension**: hooks the whole accelerator into Spark execution lifetime.
32-
- **Native Operators**: defines how each SparkPlan maps to its native execution counterparts.
33-
- **JNI Gateways**: passing data and control through JNI boundaries.
34-
- **Plan SerDe**: serialization and deserialization of DataFusion plan with protobuf.
35-
- **Columnarized Shuffle**: shuffle data file organized in Arrow-IPC format.
28+
- **Spark Extension**: hooks the whole accelerator into Spark execution lifetime.
29+
- **Spark Shims**: specialized codes for different versions of spark.
30+
- **Native Engine**: implements the native engine in rust, including:
31+
- ExecutionPlan protobuf specification
32+
- JNI gateway
33+
- Customized operators, expressions, functions
3634

3735
Based on the inherent well-defined extensibility of DataFusion, Blaze can be easily extended to support:
3836

@@ -50,87 +48,75 @@ To build Blaze, please follow the steps below:
5048

5149
1. Install Rust
5250

53-
The underlying native execution lib, DataFusion, is written in Rust Lang. So you're required to install Rust first for
54-
compilation. We recommend you to use `rustup`.
51+
The native execution lib is written in Rust. So you're required to install Rust (nightly) first for
52+
compilation. We recommend you to use [rustup](https://rustup.rs/).
5553

56-
```shell
57-
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
58-
```
54+
2. Install JDK+Maven
5955

60-
2. Check out the source code.
56+
Blaze has been well tested on jdk8 and maven3.5, should work fine with higher versions.
57+
58+
3. Check out the source code.
6159

6260
```shell
6361
git clone [email protected]:blaze-init/blaze.git
6462
cd blaze
6563
```
6664

67-
3. Build the project.
65+
4. Build the project.
6866

69-
_You could either build Blaze in debug mode for testing purposes or in release mode to unlock the full potential of
67+
_Specify shims package of which spark version that you would like to run on._
68+
_You could either build Blaze in dev mode for debugging or in release mode to unlock the full potential of
7069
Blaze._
7170

7271
```shell
73-
./gradlew -Pmode=[dev|release-lto] build
72+
SHIM=spark333 # or spark303
73+
MODE=release # or dev
74+
mvn package -P"${SHIM}" -P"${MODE}"
7475
```
7576

7677
After the build is finished, a fat Jar package that contains all the dependencies will be generated in the `target`
7778
directory.
7879

79-
## Run Spark Job with Blaze Accelerator
80-
81-
This section describes how to submit and configure a Spark Job with Blaze support.
82-
83-
You could enable Blaze accelerator through:
80+
## Build with docker
8481

82+
_You can use the following command to build a centos-7 compatible release:_
8583
```shell
86-
$SPARK_HOME/bin/spark-[sql|submit] \
87-
--jars "/path/to/blaze-engine-1.0-SNAPSHOT.jar" \
88-
--conf spark.sql.extensions=org.apache.spark.sql.blaze.BlazeSparkSessionExtension \
89-
--conf spark.executor.extraClassPath="./blaze-engine-1.0-SNAPSHOT.jar" \
90-
.... # your original arguments goes here
84+
SHIM=spark333 MODE=release ./release-docker.sh
9185
```
9286

93-
At the same time, there are a series of configurations that you can use to control Blaze with more granularity.
94-
95-
| Parameter | Default value | Description |
96-
|-------------------------------------------------------------------|-----------------------|--------------------------------------------------------------------------------------------------|
97-
| spark.executor.memoryOverhead | executor.memory * 0.1 | The amount of non-heap memory to be allocated per executor. Blaze would use this part of memory. |
98-
| spark.blaze.memoryFraction | 0.6 | A fraction of the off-heap that Blaze could use during execution. |
99-
| spark.blaze.batchSize | 10000 | Batch size for vectorized execution. |
100-
| spark.blaze.enable.shuffle | true | If enabled, use native, Arrow-IPC based Shuffle. |
101-
| spark.blaze.enable.[scan,project,filter,sort,union,sortmergejoin] | true | If enabled, offload the corresponding operator to native engine. |
102-
87+
## Run Spark Job with Blaze Accelerator
10388

104-
## Performance
89+
This section describes how to submit and configure a Spark Job with Blaze support.
10590

106-
We periodically benchmark Blaze locally with a 1 TB TPC-DS Dataset to show our latest results and prevent unnoticed
107-
performance regressions. Check [Benchmark Results](./benchmark-results/tpc-ds.md) with the latest date for the performance
108-
comparison with vanilla Spark.
91+
1. move blaze jar package to spark client classpath (normally `spark-xx.xx.xx/jars/`).
10992

110-
Currently, you can expect up to a 2x performance boost, cutting resource consumption to 1/5 within several keystrokes.
111-
Stay tuned and join us for more upcoming thrilling numbers.
93+
2. add the follow confs to spark configuration in `spark-xx.xx.xx/conf/spark-default.conf`:
94+
```properties
95+
spark.sql.extensions org.apache.spark.sql.blaze.BlazeSparkSessionExtension
96+
spark.shuffle.manager org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager
11297

113-
![20220522-memcost](./benchmark-results/blaze-prev-20220522.png)
98+
# other blaze confs defined in spark-extension/src/main/java/org/apache/spark/sql/blaze/BlazeConf.java
99+
```
114100

115-
We also encourage you to benchmark Blaze locally and share the results with us. 🤗
101+
3. submit a query with spark-sql, or other tools like spark-thriftserver:
102+
```shell
103+
spark-sql -f tpcds/q01.sql
104+
```
116105

117-
## Roadmap
118-
### 1. Operators
106+
## Performance
119107

120-
Currently, there are still several operators that we cannot execute natively:
121-
- Aggregate. Relies on https://github.com/apache/arrow-datafusion/issues/1570.
122-
- Join with an optional filter condition. Relies on https://github.com/apache/arrow-datafusion/issues/2509.
123-
- Broadcast HashJoin.
124-
- Window.
108+
Check [Benchmark Results](./benchmark-results/20230925.md) with the latest date for the performance
109+
comparison with vanilla Spark on TPC-DS 1TB dataset. The benchmark result shows that Blaze saved
110+
~40% query time and ~45% cluster resources in average. ~5x performance achieved for the best case (q06).
111+
Stay tuned and join us for more upcoming thrilling numbers.
125112

126-
### 2. Compressed Shuffle
113+
Query time:
114+
![20230925-query-time](./benchmark-results/blaze-query-time-comparison-20230925.png)
127115

128-
We use segmented Arrow-IPC files to express shuffle data. If we could apply IPC compression,
129-
we would benefit more from Shuffle since columnar data would have a better compression ratio. Tracked in [#4](https://github.com/blaze-init/blaze/issues/4).
116+
Cluster resources:
117+
![20230925-resources](./benchmark-results/blaze-executor-time-comparison-20230925.png)
130118

131-
### 3. UDF support
132-
We would like to have a high-performance JVM-UDF invocation framework that could utilize a great variety
133-
of the existing UDFs written in Spark/Hive language. They are not supported natively in Blaze at the moment.
119+
We also encourage you to benchmark Blaze and share the results with us. 🤗
134120

135121
## Community
136122

0 commit comments

Comments
 (0)