TPC-H queries implemented in Spark using the DataFrames API.
The TPC-H is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions. — https://www.tpc.org/tpch
tpch-spark
requires that Spark is installed on your machine. You can download Spark from https://spark.apache.org/downloads.html. At a high level, to install Spark you have to:
# Step 1:
tar xvfz <the tgz file you downloaded>
# Step 2:
# [optionally move/rename the untarred directory wherever you want, say, $HOME/spark]
# Step 3:
export PATH=$PATH:$HOME/spark/bin
# or better yet, add the above to your bashrc (or equivalent) and source it.
git clone https://github.com/ssavvides/tpch-spark
cd tpch-spark
Navigate to the data generator directory dbgen
and build the data generator:
cd dbgen
make
This should generate an executable called dbgen
. Use the -h
flag to see the various options the tool offers.
./dbgen -h
The simplest case is running the dbgen
executable with no flags.
./dbgen
The above generates tables with extension .tbl
with scale 1 (default) for a total of roughly 1GB size across all tables.
$ ls -hl *.tbl
-rw-rw-r-- 1 savvas savvas 24M May 28 12:39 customer.tbl
-rw-rw-r-- 1 savvas savvas 725M May 28 12:39 lineitem.tbl
-rw-rw-r-- 1 savvas savvas 2.2K May 28 12:39 nation.tbl
-rw-rw-r-- 1 savvas savvas 164M May 28 12:39 orders.tbl
-rw-rw-r-- 1 savvas savvas 114M May 28 12:39 partsupp.tbl
-rw-rw-r-- 1 savvas savvas 24M May 28 12:39 part.tbl
-rw-rw-r-- 1 savvas savvas 389 May 28 12:39 region.tbl
-rw-rw-r-- 1 savvas savvas 1.4M May 28 12:39 supplier.tbl
For different size tables you can use the -s
(scale) option. For example,
./dbgen -s 10
will generate roughly 10GB of input data.
Note that by default, dbgen
uses a |
as a column separator, and includes a |
at the end of each entry.
$ cat region.tbl
0|AFRICA|lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to |
1|AMERICA|hs use ironic, even requests. s|
2|ASIA|ges. thinly even pinto beans ca|
3|EUROPE|ly final courts cajole furiously final excuse|
4|MIDDLE EAST|uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl|
You can find the schemas of the generated tables in the TPC-H specification
tpch-spark
is written in Scala as a self-contained Spark application.
Use the provided sbt
file to build tpch-spark
as a spark application.
cd tpch-spark
sbt package
The above command will package the application into a jar file, e.g., ./target/scala-2.12/spark-tpc-h-queries_2.12-1.0.jar
which you will be needing in the next step.
You can run all TPC-H queries from Q01 to Q22 by running:
spark-submit --class "main.scala.TpchQuery" target/scala-2.12/spark-tpc-h-queries_2.12-1.0.jar
If you want to run a specific query you can use
spark-submit --class "main.scala.TpchQuery" target/scala-2.12/spark-tpc-h-queries_2.12-1.0.jar <query number>
where <query number>
is the number of the query to run, i.e., 1, 2, ..., 22.
- By default,
tpch-spark
will look for the input data files (the*.tbl
files generated bydbgen
) in"<current working directory>/dbgen"
. You can point to another location by setting the environment variableTPCH_INPUT_DATA_DIR
. - By default, the query results will be stored in
"${TPCH_INPUT_DATA_DIR}/output/{Q01, Q02, ...}
, or to whatever locationTPCH_QUERY_OUTPUT_DIR
is set. - The execution times for each query run will be stored in a file with path
"<current working directory>/tpch_execution_times.txt"
or to whatever file pathTPCH_EXECUTION_TIMES
points to.
For example, to replace the default locations you can use:
export TPCH_INPUT_DATA_DIR="$HOME/tpch-data"
export TPCH_QUERY_OUTPUT_DIR="$HOME/tpch-results"
export TPCH_EXECUTION_TIMES="$HOME/tpch-times.txt"
- Data generator (http://www.tpc.org/tpch/)
- TPC-H for Hive (https://issues.apache.org/jira/browse/hive-600)
- TPC-H for PIG (https://github.com/ssavvides/tpch-pig)