Skip to content

809: Setup of TPC‐H benchmark set

KaiGai Kohei edited this page Jun 26, 2024 · 2 revisions

Summary

This note introduces the steps to setup TPC-H benchmark dataset.

Step by step

Checkout the latest version of PG-Strom from GitHub

% git clone https://github.com/heterodb/pg-strom.git

Move to the test/tpch subdirectory

% cd pg-strom/test/tpch

Run the script to create tables

% psql DBNAME -f ./tpch-ddl.sql

Build dbgen-tpch which is based on v3.0.1. Just run make.

% make
(echo "const char *static_dists_dss =";         \
 cat dists.dss | while read line;                                       \
 do                                                                                     \
   echo $line | sed -e 's/^/  "/g' -e 's/$/\\n"/g'; \
 done; echo ";") > dists.dss.h
gcc -g -DDBNAME=\"dss\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o build.o build.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o driver.o driver.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o bm_utils.o bm_utils.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o rnd.o rnd.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o print.o print.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o load_stub.o load_stub.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o bcd2.o bcd2.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o speed_seed.o speed_seed.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o text.o text.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o permute.o permute.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o rng64.o rng64.c
gcc -g -DDBNAME=\"dss\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64  -O -o dbgen-tpch build.o driver.o bm_utils.o rnd.o print.o load_stub.o bcd2.o speed_seed.o text.o permute.o rng64.o -lm

Load the dataset using dbgen-tpch.

% psql DBNAME

=# \copy customer from program './dbgen-tpch -s 1000 -X -Tc' delimiter '|'
=# \copy lineitem from program './dbgen-tpch -s 1000 -X -TL' delimiter '|'
=# \copy nation   from program './dbgen-tpch -s 1000 -X -Tn' delimiter '|'
=# \copy orders   from program './dbgen-tpch -s 1000 -X -TO' delimiter '|'
=# \copy part     from program './dbgen-tpch -s 1000 -X -Tp' delimiter '|'
=# \copy partsupp from program './dbgen-tpch -s 1000 -X -TS' delimiter '|'
=# \copy region   from program './dbgen-tpch -s 1000 -X -Tr' delimiter '|'
=# \copy supplier from program './dbgen-tpch -s 1000 -X -Ts' delimiter '|'

=# VACUUM ANALYZE;

After that, TPC-H benchmark database will be built. -s is scaling factor. SF=1 is almost 1GB, therefore, -s 1000 will generate 1TB dataset in total.

You can see the following database size if SF=1000. It is not just 1000GB, because PostgreSQL adds 24bytes header for each rows in front of the payload.

tpch=# \d+
                                      List of relations
 Schema |   Name   | Type  | Owner  | Persistence | Access method |    Size    | Description
--------+----------+-------+--------+-------------+---------------+------------+-------------
 public | customer | table | kaigai | permanent   | heap          | 28 MB      |
 public | lineitem | table | kaigai | permanent   | heap          | 882 GB     |
 public | nation   | table | kaigai | permanent   | heap          | 40 kB      |
 public | orders   | table | kaigai | permanent   | heap          | 205 GB     |
 public | part     | table | kaigai | permanent   | heap          | 31 GB      |
 public | partsupp | table | kaigai | permanent   | heap          | 134 GB     |
 public | region   | table | kaigai | permanent   | heap          | 8192 bytes |
 public | supplier | table | kaigai | permanent   | heap          | 1736 MB    |
(8 rows)

You can see the test queries in test/tpch/queries directory.

$ ls queries/
tpch-01.sql  tpch-03.sql  tpch-05.sql  tpch-07.sql  tpch-09.sql  tpch-11.sql  tpch-13.sql  tpch-15.sql  tpch-17.sql  tpch-19.sql  tpch-21.sql
tpch-02.sql  tpch-04.sql  tpch-06.sql  tpch-08.sql  tpch-10.sql  tpch-12.sql  tpch-14.sql  tpch-16.sql  tpch-18.sql  tpch-20.sql  tpch-22.sql

And, you can give the query as follows:

tpch=# \i queries/tpch-01.sql