Skip to content

801: How to setup & run SSBM

KaiGai Kohei edited this page Jan 16, 2020 · 4 revisions

Summary

This note introduces the way to setup and run Star Schema Benchmark (SSBM) that is usually used for performance measurement of PG-Strom with SSD-to-GPU Direct SQL mode.

Description

Prerequisites

  • The latest version of PG-Strom is successfully installed.
  • Operating System recognizes two or more NVME-SSD drives.

Setup md-raid0 volume

If you have multiple NVME-SSD drives, make md-raid0 configuration. The bunch of NVME drives must be connected to a particular CPU or PCIe switch if any.

For example, if NVME0-3 are connected to CPU1 and NVME4-7 are connected to CPU2, you should not mix up NVME2 and NVME6 in a single md-raid0 volume. In this case, we recommend to make two md-raid0 volumes for each.

# mdadm -C /dev/md0 -c 128 -l 0 -n 3 /dev/nvme0n1p1 /dev/nvme1n1p1 /dev/nvme2n1p1

This example makes a md-raid0 volume that consists of 3 NVME drives. Then, you may setup partition on the md-raid0 volume, and needs to format using ext4 filesystem.

# fdisk /dev/md0
# mkfs.ext4 /dev/md0p1

This volume shall be mounted on somewhere, then you must change directory ownership and security context if SELinux is enabled.

# mount /dev/md0p1  /opt/nvme
# chown postgresql:postgresql -R /opt/nvme
(if SELinux is enabled)
# chcon -R system_u:object_r:postgresql_db_t:s0 /opt/nvme/

At this point, /opt/nvme directory is read-writable for PostgreSQL server process.

Create tablespace

Run CREATE TABLESPACE command to create a tablespace that points the directory built on the md-raid0 volume.

postgres=# CREATE TABLESPACE nvme LOCATION '/opt/nvme';
CREATE TABLESPACE

Create SSBM tables

You may be able to find out DDL script at test/ssbm/ssbm-ddl.sql.

$ git clone https://github.com/heterodb/pg-strom.git
$ psql -f pg-strom/test/ssbm/ssbm-ddl.sql

This script does not care about tablespace. So, run the command below to locate lineorder table which is the largest portion at the SSBM. It should be done prior to the data loading, because change of tablespace is an expensive operation for large table.

ssbm=# ALTER TABLE lineorder SET TABLESPACE nvme;
ALTER TABLE.

Use of multiple NVME-SSD clusters

For example, if your system has NVME0-3 connected to CPU1 and NVME4-7 connected to CPU2, you might create two md-raid0 volumes for each. In this case, you may create two tablespaces. PostgreSQL v11 newly supported hash-partition feature that distributes records into multiple tables that can locate on own tablespace individually. We recommend to use hash-partition if you want to use multiple tablespaces.

The ssbm-ddl.sql script needs to be adjusted a bit. Add PARTITION BY HASH (lo_orderkey) at the tail of lineorder definition.

CREATE TABLE lineorder (
    lo_orderkey numeric,
    lo_linenumber integer,
      :
    lo_shipmode character(10)
)
PARTITION BY HASH (lo_orderkey);

Next, you can create partition leafs for each tablespace.

CREATE TABLE lineorder_p0 PARTITION OF lineorder FOR VALUES WITH (MODULUS 2, REMAINDER 0) TABLESPACE nvme0;
CREATE TABLE lineorder_p1 PARTITION OF lineorder FOR VALUES WITH (MODULUS 2, REMAINDER 1) TABLESPACE nvme1;

Load SSBM data

You can also find out utils/dbgen-ssbm command at the PG-Strom source directory. This version of dbgen command is a bit extended to generate data into stdout. So, you can load SSBM data using COPY ... PROGRAM command without any temporary files.

=# \copy customer FROM PROGRAM './utils/dbgen-ssbm -Tc -X -s 401' DELIMITER '|';
=# \copy date1 FROM PROGRAM './utils/dbgen-ssbm -Td -X -s 401' DELIMITER '|';
=# \copy part FROM PROGRAM './utils/dbgen-ssbm -Tp -X -s 401' DELIMITER '|';
=# \copy supplier FROM PROGRAM './utils/dbgen-ssbm -Ts -X -s 401' DELIMITER '|';
=# \copy lineorder FROM PROGRAM './utils/dbgen-ssbm -Tl -X -s 401' DELIMITER '|';

Note that this example uses \copy command of psql, so argument of the PROGRAM clause is from standpoint of the client. Its command path is from the current working directory. Also note that this example specifies the scale-factor = 401. It means the database size is about 353GB.

vacuum analyze

If you utilize SSD-to-GPU Direct SQL on SSBM, please don't forget to run vacuum analyze on the tables. Because current version of PG-Strom cannot check MVCC visibility of individual tuples on GPU-side, only "all-visible" blocks are directly loaded to GPU from NVME-SSD. The vacuum analyze command enforce to generate visibility map of the tables, so it is highly recommended before execution of the benchmark.

=# vacuum analyze customer;
=# vacuum analyze date1;
=# vacuum analyze part;
=# vacuum analyze supplier;
=# vacuum analyze lineorder;

Run SSBM query

You can find 13-queries of SSBM at test/sql. Run these scripts as follows.

$ psql ssbm
ssbm=# \timing on
Timing is on.

ssbm=# \i test/sql/ssbm-11.sql
   revenue
--------------
 446031203850
(1 row)

Time: 21514.983 ms (00:21.515)

You can run other test cases (test/sql/ssbm-*.sql) in a similar way.

psql shows query execution time (smaller it better). We usually calculate query execution throughput (bigger is better) by Total database size[MB] / Query execution time[sec]. For example, total database size is 353GB if scaling factor is 401. If query execution time is 60.0 sec, its throughput is (353 * 1024) / 60.0 = 6024.5MB/s, because SSBM never scans lineorder table multiple times.

For the test cases where total database size is larger than the system main memory, we usually preset the following parameters for better results. Just for your information.

SET pg_strom.enabled = on;
SET max_parallel_workers_per_gather = 2;
SET parallel_setup_cost = 100000;

Using arrow_fdw for lineorder

lineorder table is the main part of the SSBM dataset. We can apply columnar storage using Arrow_Fdw and pg2arrow. Due to the nature of columnar storage, DBMS can load only referenced columns, therefore, it reduces amount of I/O.

Making Apache Arrow file

pg2arrow is a utility command to run a supplied SQL command, and save its results as Apache Arrow file. The example below makes a clone of lineorder table in Apache Arrow format.

$ pg2arrow -h localhost -d ssbm -c 'SELECT lo_orderkey,lo_linenumber,lo_custkey,lo_partkey,lo_suppkey,lo_orderdate,lo_orderpriority,lo_shippriority,lo_quantity::int,lo_extendedprice::int,lo_ordertotalprice::int,lo_discount::int,lo_revenue::int,lo_supplycost::int,lo_tax::int,lo_commit_date,lo_shipmode FROM lineorder' -o flineorder.arrow

-h option specifies the server address, -d option specifies the source database, and -c option specifies a SQL command to run, then write to the file specified by -o option. If you want to dump a large table like lineorder, --progress may makes sense.

Why some numeric values should be casted to int/real

Instead of SELECT *, the above example changes some numeric field to int. Please note that numeric data type in PostgreSQL is mapped to Decimal data type in Apache Arrow; that is 128bit fixed-point numbers. It means small digits numbers may consume more storage in Arrow::Decimal than numeric in PostgreSQL. For example, 234 is a three-digits number; that consumes 5-bytes in numeric (1byte for varlena header, 2bytes for numeric header, and 2bytes for four-digits at most). Because Arrow::Decimal takes 16bytes for any numbers, it is not efficient so much. If you know data range is sufficient for int/real, we like to recommend to cast.

Create a foreign table

The SQL command below makes a foreign table that maps the Apache Arrow file above.

=# IMPORT FOREIGN SCHEMA flineorder FROM SERVER arrow_fdw INTO public OPTIONS (file '/path/to/flineorder.arrow');

It automatically constructs schema definition as a foreign table using Schema information built in Apache Arrow file. The CREATE FOREIGN TABLE command requires to specify columns definition list that strictly matches to the Field definitions in the Apache Arrow file, so the automatic way (IMPORT FOREIGN SCHEMA) will be comfortable for most people.

Run SSBM queries

Replace lineorder in SSBM queries by flineorder to use the foreign-table above, instead of the row-store of PostgreSQL.

Example:Q1_1)
select sum(lo_extendedprice*lo_discount) as revenue
from flineorder,date1
where lo_orderdate = d_datekey
and d_year = 1993
and lo_discount between 1 and 3
and lo_quantity < 25;

In case of partitioned table

First of all, make Apache Arrow files for each partition-leaf table, using pg2arrow.

Second, make a flineorder table as a partitioned table, but not a foreign table. Please don't forget to follow the data type, if you changed it on pg2arrow.

=# CREATE TABLE flineorder (
       lo_orderkey numeric,
       lo_linenumber integer,
       lo_custkey numeric,
       lo_partkey integer,
       lo_suppkey numeric,
       lo_orderdate integer,
       lo_orderpriority character(15),
       lo_shippriority character(1),
       lo_quantity int,
       lo_extendedprice int,
       lo_ordertotalprice int,
       lo_discount int,
       lo_revenue int,
       lo_supplycost int,
       lo_tax int,
       lo_commit_date character(8),
       lo_shipmode character(10)
)
PARTITION BY HASH (lo_orderkey);

Then, add foreign-tables as a partition leaf.

=# CREATE FOREIGN TABLE flineorder__p0 PARTITION OF flineorder
   FOR VALUES WITH ( MODULUS 3, REMAINDER 0 )
   SERVER arrow_fdw
   OPTIONS (file '/path/to/flineorder__p0.arrow');
=# CREATE FOREIGN TABLE flineorder__p1 PARTITION OF flineorder
   FOR VALUES WITH ( MODULUS 3, REMAINDER 1 )
   SERVER arrow_fdw
   OPTIONS (file '/path/to/flineorder__p1.arrow');
=# CREATE FOREIGN TABLE flineorder__p2 PARTITION OF flineorder
   FOR VALUES WITH ( MODULUS 3, REMAINDER 2 )
   SERVER arrow_fdw
   OPTIONS (file '/path/to/flineorder__p2.arrow');

Please don't forget ANALYZE on these tables to build table statistics and makes better query execution plan.

=# ANALYZE flineorder__p0;
=# ANALYZE flineorder__p1;
=# ANALYZE flineorder__p2;

Metadata

  • Author: KaiGai Kohei [email protected]
  • Initial Date: 17-Jan-2020
  • Software version: PG-Strom v2.2