apache
diff --git a/‎README.md‎
Lines changed: 44 additions & 58 deletions b/‎README.md‎
Lines changed: 44 additions & 58 deletions
@@ -15,9 +15,6 @@
 -->
 
 # BLAZE
-[![test](https://github.com/blaze-init/blaze-rs/actions/workflows/rust.yml/badge.svg)](https://github.com/blaze-init/blaze-rs/actions/workflows/rust.yml)
-[![test](https://github.com/blaze-init/blaze-rs/actions/workflows/tpcds.yml/badge.svg)](https://github.com/blaze-init/blaze-rs/actions/workflows/tpcds.yml)
-
 
 The Blaze accelerator for Apache Spark leverages native vectorized execution to accelerate query processing. It combines
 the power of the [Apache Arrow-DataFusion](https://arrow.apache.org/datafusion/) library and the scale of the Spark distributed
@@ -28,11 +25,12 @@ plan computation in Spark executors.
 
 Blaze is composed of the following high-level components:
 
-- **Blaze Spark Extension**: hooks the whole accelerator into Spark execution lifetime.
-- **Native Operators**: defines how each SparkPlan maps to its native execution counterparts.
-- **JNI Gateways**: passing data and control through JNI boundaries.
-- **Plan SerDe**: serialization and deserialization of DataFusion plan with protobuf.
-- **Columnarized Shuffle**: shuffle data file organized in Arrow-IPC format.
+- **Spark Extension**: hooks the whole accelerator into Spark execution lifetime.
+- **Spark Shims**: specialized codes for different versions of spark.
+- **Native Engine**: implements the native engine in rust, including:
+  - ExecutionPlan protobuf specification
+  - JNI gateway
+  - Customized operators, expressions, functions
 
 Based on the inherent well-defined extensibility of DataFusion, Blaze can be easily extended to support:
 
@@ -50,87 +48,75 @@ To build Blaze, please follow the steps below:
 
 1. Install Rust
 
-The underlying native execution lib, DataFusion, is written in Rust Lang. So you're required to install Rust first for
-compilation. We recommend you to use `rustup`.
+The native execution lib is written in Rust. So you're required to install Rust (nightly) first for
+compilation. We recommend you to use [rustup](https://rustup.rs/).
 
-```shell
-curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
-```
+2. Install JDK+Maven
 
-2. Check out the source code.
+Blaze has been well tested on jdk8 and maven3.5, should work fine with higher versions.
+
+3. Check out the source code.
 
 ```shell
 git clone [email protected]:blaze-init/blaze.git
 cd blaze
 ```
 
-3. Build the project.
+4. Build the project.
 
-_You could either build Blaze in debug mode for testing purposes or in release mode to unlock the full potential of
+_Specify shims package of which spark version that you would like to run on._
+_You could either build Blaze in dev mode for debugging or in release mode to unlock the full potential of
 Blaze._
 
 ```shell
-./gradlew -Pmode=[dev|release-lto] build
+SHIM=spark333 # or spark303
+MODE=release # or dev
+mvn package -P"${SHIM}" -P"${MODE}"
 ```
 
 After the build is finished, a fat Jar package that contains all the dependencies will be generated in the `target`
 directory.
 
-## Run Spark Job with Blaze Accelerator
-
-This section describes how to submit and configure a Spark Job with Blaze support.
-
-You could enable Blaze accelerator through:
+## Build with docker
 
+_You can use the following command to build a centos-7 compatible release:_
 ```shell
-$SPARK_HOME/bin/spark-[sql|submit] \
-  --jars "/path/to/blaze-engine-1.0-SNAPSHOT.jar" \
-  --conf spark.sql.extensions=org.apache.spark.sql.blaze.BlazeSparkSessionExtension \
-  --conf spark.executor.extraClassPath="./blaze-engine-1.0-SNAPSHOT.jar" \
-  .... # your original arguments goes here
+SHIM=spark333 MODE=release ./release-docker.sh
 ```
 
-At the same time, there are a series of configurations that you can use to control Blaze with more granularity.
-
-| Parameter                                                         | Default value         | Description                                                                                      |
-|-------------------------------------------------------------------|-----------------------|--------------------------------------------------------------------------------------------------|
-| spark.executor.memoryOverhead                                     | executor.memory * 0.1 | The amount of non-heap memory to be allocated per executor. Blaze would use this part of memory. |
-| spark.blaze.memoryFraction                                        | 0.6                   | A fraction of the off-heap that Blaze could use during execution.                                |
-| spark.blaze.batchSize                                             | 10000                 | Batch size for vectorized execution.                                                             |
-| spark.blaze.enable.shuffle                                        | true                  | If enabled, use native, Arrow-IPC based Shuffle.                                                 |
-| spark.blaze.enable.[scan,project,filter,sort,union,sortmergejoin] | true                  | If enabled, offload the corresponding operator to native engine.                                 |
-
+## Run Spark Job with Blaze Accelerator
 
-## Performance
+This section describes how to submit and configure a Spark Job with Blaze support.
 
-We periodically benchmark Blaze locally with a 1 TB TPC-DS Dataset to show our latest results and prevent unnoticed
-performance regressions. Check [Benchmark Results](./benchmark-results/tpc-ds.md) with the latest date for the performance
-comparison with vanilla Spark.
+1. move blaze jar package to spark client classpath (normally `spark-xx.xx.xx/jars/`).
 
-Currently, you can expect up to a 2x performance boost, cutting resource consumption to 1/5 within several keystrokes.
-Stay tuned and join us for more upcoming thrilling numbers.
+2. add the follow confs to spark configuration in `spark-xx.xx.xx/conf/spark-default.conf`:
+```properties
+spark.sql.extensions org.apache.spark.sql.blaze.BlazeSparkSessionExtension
+spark.shuffle.manager org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager
 
-![20220522-memcost](./benchmark-results/blaze-prev-20220522.png)
+# other blaze confs defined in spark-extension/src/main/java/org/apache/spark/sql/blaze/BlazeConf.java
+```
 
-We also encourage you to benchmark Blaze locally and share the results with us. 🤗
+3. submit a query with spark-sql, or other tools like spark-thriftserver:
+```shell
+spark-sql -f tpcds/q01.sql
+```
 
-## Roadmap
-### 1. Operators
+## Performance
 
-Currently, there are still several operators that we cannot execute natively:
-- Aggregate. Relies on https://github.com/apache/arrow-datafusion/issues/1570.
-- Join with an optional filter condition. Relies on https://github.com/apache/arrow-datafusion/issues/2509.
-- Broadcast HashJoin.
-- Window.
+Check [Benchmark Results](./benchmark-results/20230925.md) with the latest date for the performance
+comparison with vanilla Spark on TPC-DS 1TB dataset. The benchmark result shows that Blaze saved
+~40% query time and ~45% cluster resources in average. ~5x performance achieved for the best case (q06).
+Stay tuned and join us for more upcoming thrilling numbers.
 
-### 2. Compressed Shuffle
+Query time:
+![20230925-query-time](./benchmark-results/blaze-query-time-comparison-20230925.png)
 
-We use segmented Arrow-IPC files to express shuffle data. If we could apply IPC compression,
-we would benefit more from Shuffle since columnar data would have a better compression ratio. Tracked in [#4](https://github.com/blaze-init/blaze/issues/4).
+Cluster resources:
+![20230925-resources](./benchmark-results/blaze-executor-time-comparison-20230925.png)
 
-### 3. UDF support
-We would like to have a high-performance JVM-UDF invocation framework that could utilize a great variety
-of the existing UDFs written in Spark/Hive language. They are not supported natively in Blaze at the moment.
+We also encourage you to benchmark Blaze and share the results with us. 🤗
 
 ## Community