-
Notifications
You must be signed in to change notification settings - Fork 162
801: How to setup & run SSBM
This note introduces the way to setup and run Star Schema Benchmark (SSBM) that is usually used for performance measurement of PG-Strom with SSD-to-GPU Direct SQL mode.
- The latest version of PG-Strom is successfully installed.
- Operating System recognizes two or more NVME-SSD drives.
If you have multiple NVME-SSD drives, make md-raid0 configuration. The bunch of NVME drives must be connected to a particular CPU or PCIe switch if any.
For example, if NVME0-3 are connected to CPU1 and NVME4-7 are connected to CPU2, you should not mix up NVME2 and NVME6 in a single md-raid0 volume. In this case, we recommend to make two md-raid0 volumes for each.
# mdadm -C /dev/md0 -c 128 -l 0 -n 3 /dev/nvme0n1p1 /dev/nvme1n1p1 /dev/nvme2n1p1
This example makes a md-raid0 volume that consists of 3 NVME drives.
Then, you may setup partition on the md-raid0 volume, and needs to format using ext4
filesystem.
# fdisk /dev/md0
# mkfs.ext4 /dev/md0p1
This volume shall be mounted on somewhere, then you must change directory ownership and security context if SELinux is enabled.
# mount /dev/md0p1 /opt/nvme
# chown postgresql:postgresql -R /opt/nvme
(if SELinux is enabled)
# chcon -R system_u:object_r:postgresql_db_t:s0 /opt/nvme/
At this point, /opt/nvme
directory is read-writable for PostgreSQL server process.
Run CREATE TABLESPACE
command to create a tablespace that points the directory built on the md-raid0 volume.
postgres=# CREATE TABLESPACE nvme LOCATION '/opt/nvme';
CREATE TABLESPACE
You may be able to find out DDL script at test/ssbm/ssbm-ddl.sql
.
$ git clone https://github.com/heterodb/pg-strom.git
$ psql -f pg-strom/test/ssbm/ssbm-ddl.sql
This script does not care about tablespace. So, run the command below to locate lineorder
table which is the largest portion at the SSBM. It should be done prior to the data loading, because change of tablespace is an expensive operation for large table.
ssbm=# ALTER TABLE lineorder SET TABLESPACE nvme;
ALTER TABLE.
For example, if your system has NVME0-3 connected to CPU1 and NVME4-7 connected to CPU2, you might create two md-raid0 volumes for each. In this case, you may create two tablespaces. PostgreSQL v11 newly supported hash-partition feature that distributes records into multiple tables that can locate on own tablespace individually. We recommend to use hash-partition if you want to use multiple tablespaces.
The ssbm-ddl.sql
script needs to be adjusted a bit. Add PARTITION BY HASH (lo_orderkey)
at the tail of lineorder
definition.
CREATE TABLE lineorder (
lo_orderkey numeric,
lo_linenumber integer,
:
lo_shipmode character(10)
)
PARTITION BY HASH (lo_orderkey);
Next, you can create partition leafs for each tablespace.
CREATE TABLE lineorder_p0 PARTITION OF lineorder FOR VALUES WITH (MODULUS 2, REMAINDER 0) TABLESPACE nvme0;
CREATE TABLE lineorder_p1 PARTITION OF lineorder FOR VALUES WITH (MODULUS 2, REMAINDER 1) TABLESPACE nvme1;
You can also find out utils/dbgen-ssbm
command at the PG-Strom source directory.
This version of dbgen command is a bit extended to generate data into stdout.
So, you can load SSBM data using COPY ... PROGRAM
command without any temporary files.
=# \copy customer FROM PROGRAM './utils/dbgen-ssbm -Tc -X -s 401' DELIMITER '|';
=# \copy date1 FROM PROGRAM './utils/dbgen-ssbm -Td -X -s 401' DELIMITER '|';
=# \copy part FROM PROGRAM './utils/dbgen-ssbm -Tp -X -s 401' DELIMITER '|';
=# \copy supplier FROM PROGRAM './utils/dbgen-ssbm -Ts -X -s 401' DELIMITER '|';
=# \copy lineorder FROM PROGRAM './utils/dbgen-ssbm -Tl -X -s 401' DELIMITER '|';
Note that this example uses \copy
command of psql
, so argument of the PROGRAM
clause is from standpoint of the client. Its command path is from the current working directory.
Also note that this example specifies the scale-factor = 401. It means the database size is about 353GB.
If you utilize SSD-to-GPU Direct SQL on SSBM, please don't forget to run vacuum analyze
on the tables.
Because current version of PG-Strom cannot check MVCC visibility of individual tuples on GPU-side, only "all-visible" blocks are directly loaded to GPU from NVME-SSD.
The vacuum analyze
command enforce to generate visibility map of the tables, so it is highly recommended before execution of the benchmark.
=# vacuum analyze customer;
=# vacuum analyze date1;
=# vacuum analyze part;
=# vacuum analyze supplier;
=# vacuum analyze lineorder;
You can find 13-queries of SSBM at test/sql
.
Run these scripts as follows.
$ psql ssbm
ssbm=# \timing on
Timing is on.
ssbm=# \i test/sql/ssbm-11.sql
revenue
--------------
446031203850
(1 row)
Time: 21514.983 ms (00:21.515)
You can run other test cases (test/sql/ssbm-*.sql
) in a similar way.
psql
shows query execution time (smaller it better). We usually calculate query execution throughput (bigger is better) by Total database size[MB]
/ Query execution time[sec]
.
For example, total database size is 353GB if scaling factor is 401. If query execution time is 60.0 sec, its throughput is (353 * 1024) / 60.0 = 6024.5MB/s
, because SSBM never scans lineorder
table multiple times.
For the test cases where total database size is larger than the system main memory, we usually preset the following parameters for better results. Just for your information.
SET pg_strom.enabled = on;
SET max_parallel_workers_per_gather = 2;
SET parallel_setup_cost = 100000;
lineorder
table is the main part of the SSBM dataset. We can apply columnar storage using Arrow_Fdw and pg2arrow
.
Due to the nature of columnar storage, DBMS can load only referenced columns, therefore, it reduces amount of I/O.
pg2arrow
is a utility command to run a supplied SQL command, and save its results as Apache Arrow file.
The example below makes a clone of lineorder
table in Apache Arrow format.
$ pg2arrow -h localhost -d ssbm -c 'SELECT lo_orderkey,lo_linenumber,lo_custkey,lo_partkey,lo_suppkey,lo_orderdate,lo_orderpriority,lo_shippriority,lo_quantity::int,lo_extendedprice::int,lo_ordertotalprice::int,lo_discount::int,lo_revenue::int,lo_supplycost::int,lo_tax::int,lo_commit_date,lo_shipmode FROM lineorder' -o flineorder.arrow
-h
option specifies the server address, -d
option specifies the source database, and -c
option specifies a SQL command to run, then write to the file specified by -o
option. If you want to dump a large table like lineorder
, --progress
may makes sense.
Instead of SELECT *
, the above example changes some numeric
field to int
.
Please note that numeric
data type in PostgreSQL is mapped to Decimal
data type in Apache Arrow; that is 128bit fixed-point numbers. It means small digits numbers may consume more storage in Arrow::Decimal
than numeric
in PostgreSQL.
For example, 234
is a three-digits number; that consumes 5-bytes in numeric
(1byte for varlena header, 2bytes for numeric header, and 2bytes for four-digits at most). Because Arrow::Decimal
takes 16bytes for any numbers, it is not efficient so much. If you know data range is sufficient for int/real, we like to recommend to cast.
The SQL command below makes a foreign table that maps the Apache Arrow file above.
=# IMPORT FOREIGN SCHEMA flineorder FROM SERVER arrow_fdw INTO public OPTIONS (file '/path/to/flineorder.arrow');
It automatically constructs schema definition as a foreign table using Schema information built in Apache Arrow file.
The CREATE FOREIGN TABLE
command requires to specify columns definition list that strictly matches to the Field definitions in the Apache Arrow file, so the automatic way (IMPORT FOREIGN SCHEMA
) will be comfortable for most people.
Replace lineorder
in SSBM queries by flineorder
to use the foreign-table above, instead of the row-store of PostgreSQL.
Example:Q1_1)
select sum(lo_extendedprice*lo_discount) as revenue
from flineorder,date1
where lo_orderdate = d_datekey
and d_year = 1993
and lo_discount between 1 and 3
and lo_quantity < 25;
First of all, make Apache Arrow files for each partition-leaf table, using pg2arrow
.
Second, make a flineorder
table as a partitioned table, but not a foreign table.
Please don't forget to follow the data type, if you changed it on pg2arrow
.
=# CREATE TABLE flineorder (
lo_orderkey numeric,
lo_linenumber integer,
lo_custkey numeric,
lo_partkey integer,
lo_suppkey numeric,
lo_orderdate integer,
lo_orderpriority character(15),
lo_shippriority character(1),
lo_quantity int,
lo_extendedprice int,
lo_ordertotalprice int,
lo_discount int,
lo_revenue int,
lo_supplycost int,
lo_tax int,
lo_commit_date character(8),
lo_shipmode character(10)
)
PARTITION BY HASH (lo_orderkey);
Then, add foreign-tables as a partition leaf.
=# CREATE FOREIGN TABLE flineorder__p0 PARTITION OF flineorder
FOR VALUES WITH ( MODULUS 3, REMAINDER 0 )
SERVER arrow_fdw
OPTIONS (file '/path/to/flineorder__p0.arrow');
=# CREATE FOREIGN TABLE flineorder__p1 PARTITION OF flineorder
FOR VALUES WITH ( MODULUS 3, REMAINDER 1 )
SERVER arrow_fdw
OPTIONS (file '/path/to/flineorder__p1.arrow');
=# CREATE FOREIGN TABLE flineorder__p2 PARTITION OF flineorder
FOR VALUES WITH ( MODULUS 3, REMAINDER 2 )
SERVER arrow_fdw
OPTIONS (file '/path/to/flineorder__p2.arrow');
Please don't forget ANALYZE
on these tables to build table statistics and makes better query execution plan.
=# ANALYZE flineorder__p0;
=# ANALYZE flineorder__p1;
=# ANALYZE flineorder__p2;
- Author: KaiGai Kohei [email protected]
- Initial Date: 17-Jan-2020
- Software version: PG-Strom v2.2