TPC-DS Generation, Execution and Analyzer for Postgres
Making TPC-DS Tools:
git clone https://github.com/celuk/tpcds-postgres
cd tpcds-postgres/DSGen-software-code-3.2.0rc1/tools
make
Database Generation:
cd ../..
nano pgtpcds_defaults ## change required variables
sudo ./tpcds_generator.sh
Splitting and Fixing Sqls:
cd your_query0_generated_directory ## either go to this directory or place query_0.sql and the scripts in the same directory.
python3 split_sqls.py
python3 split_analyzing_sqls.py
Getting Analyzed Texts:
nano get_analyzed_txts.sh ## change required variables
sudo ./get_analyzed_txts.sh
Installing python dependencies for graph_analyzed_txts.py
:
pip3 install matplotlib
pip3 install BeautifulSoup4
pip3 install lxml
pip3 install numpy
pip3 install pprint
Plotting Stacked Bar Chart Graphs from Analyzed Texts:
nano graph_analyzed_txts.py ## change required variables
python3 graph_analyzed_txts.py --depesz ## after that you will get .png and .pdfs of stacked bar chart
1-) TPC-DS v3.2.0 benchmarks (DSGen-software-code-3.2.0rc1) are downloaded from TPC site and because it was giving error in the generation, I added one line in netezza.tpl file (according to this: https://dba.stackexchange.com/questions/36938/how-to-generate-tpc-ds-query-for-sql-server-from-templates/97926#97926):
--- a/DSGen-software-code-3.2.0rc1/query_templates/netezza.tpl
+++ b/DSGen-software-code-3.2.0rc1/query_templates/netezza.tpl
@@ -32,6 +32,9 @@
--
-- Contributors:
--
+
+define _END = "";
+
define __LIMITA = "";
define __LIMITB = "";
define __LIMITC = "limit %d";
2-) There was an error in the generator while reading customer.dat
because of UTF-8
formatting. So, for fixing this, I added encoding fixer python script fix_encoding.py and it is automatically running while generating the database via tpcds_generator.sh bash script and fixes customer.dat
encoding.
3-) Other errors was because syntax of 19 (out of 99) queries when running queries in the generated postgres database. This is because of query templates written in ANSI SQL
format and is not compatible totaly with PostgreSQL
format. Syntax errors were in 5, 12, 16, 20, 21, 30, 32, 36, 37, 40, 70, 77, 80, 82, 86, 92, 94, 95, 98
queries.
I could use automatic converters like jOOQ but this does not resolve the column name errors.
I could do modifications in .tpl
files but I didn't want to change original source. I am doing modifications after generation of sql queries while splitting via split_analyzing_sqls.py or split_sqls.py scripts.
Note: Note that SQL queries should be fixed for every database generation because in TPC-DS
while a database is generating, it is also generating the queries according to the SCALE
factor (size of database in gigabytes).
Errors and fixes can be seen below:
Syntax Error in query30.sql
Column name error:
c_last_review_date_sk
Fix of query30.sql Syntax Error
c_last_review_date_sk
should be changed with c_last_review_date
.
In python scripts I did:
each_text = each_text.replace('c_last_review_date_sk', 'c_last_review_date')
Syntax Errors in 5, 12, 16, 20, 21, 32, 37, 40, 77, 80, 82, 92, 94, 95, 98 queries
days
syntax is not valid for PostgreSQL
.
Fix of 5, 12, 16, 20, 21, 32, 37, 40, 77, 80, 82, 92, 94, 95, 98 queries Syntax Errors
Either all days
keywords should be removed or changed like this '30 days'::interval
.
In python scripts I did:
each_text = each_text.replace('days', '')
Syntax Errors in 36, 70, 86 queries
Column name alias error.
A quote from https://github.com/RunningJon/TPC-DS that explains the problem:
Query templates were modified to exclude columns not found in the query. In these cases, the
common table expression used aliased columns but the dynamic filters included both the alias
name as well as the original name. Referencing the original column name instead of the alias
causes the query parser to not find the column.
Fix of 36, 70, 86 queries Syntax Errors
Using subquery fixes the problem. So, select * from (
should be added the head of the queries and ) as sub
should be added before order by
section. And it makes the piece of code to subquery like this:
select * from ( -- this is added to head of the query before select
-- ...
-- encapsulated code
-- ...
) as sub -- this is added before the last order by
order by
-- ...
In python scripts I did:
each_text = each_text.replace('select', 'select * from (select ', 1)
each_text = ') as sub\n order by'.join(each_text.rsplit('order by', 1))
Generation phase is imitated from https://github.com/AXLEproject/pg-tpch and https://ankane.org/tpc-ds, and then the process is automated.
For generation, firstly clone this repository, go to directory tpcds-postgres/DSGen-software-code-3.2.0rc1/tools
and type make
in the terminal to compile generation tools of TPC:
git clone https://github.com/celuk/tpcds-postgres
cd tpcds-postgres/DSGen-software-code-3.2.0rc1/tools
make
Secondly, change the required variables in pgtpcds_defaults
file.
Thirdly, run tpcds_generator.sh with sudo
:
sudo ./tpcds_generator.sh
Note: There are a lot of unnecessary sudo
s in tpcds_generator.sh
because some servers may need it but you can try to remove them to see if it works in your case.
After generation of the database, all 99 queries will be generated in just one file which is query_0.sql
. For seperating the queries we have two python scripts:
* split_sqls.py --> Splits `query_0.sql` to `query1.sql`, `query2.sql`, ..., `query99.sql`
* split_analyzing_sqls.py --> Splits `query_0.sql` to `query1.sql`, `query2.sql`, ..., `query99.sql`
with at the beginning `explain analyze` keyword that gives analyzed output after running.
Firstly place query_0.sql
and the python scripts in the same place or run python scripts in directory of query_0.sql
.
For getting normal SQLs run:
python3 split_sqls.py
For getting analyzing SQLs run:
python3 split_analyzing_sqls.py
You need to start the database firstly as:
sudo -u <username> pg_ctl -D <database-name> start
An example to start database:
sudo -u guest /home/guest/postgres/postgres-compiled/bin/pg_ctl -D /home/guest/bsc/databases/pgdata1GB start
If you cannot start because of an old process, please look at this answer to solve the problem and try to start database again: https://stackoverflow.com/questions/52963079/i-cant-start-server-postgresql-11-pg-ctl-could-not-start-server/73868082#73868082
For running a query, you can use this command:
sudo -u <username> psql -d <database-name> -f <query-name>
or this command:
sudo -u <username> psql <database-name> < <query-name>
An example to run a query:
sudo -u guest /home/guest/postgres/postgres-compiled/bin/psql tpcds1gb < query1.sql
For running all queries you can use get_analyzed_txts.sh. Firstly edit the required variables in the file, then you can run as:
sudo ./get_analyzed_txts.sh
After that if you run:
- normal sqls --> you will get outputs of sql commands in a folder as seperate
.txt
files. - analyzing sqls --> you will get analyzed outputs of sql commands in a folder as seperate
.txt
files. Asq1a.txt
,q2a.txt
, ...,q99a.txt
An example output of analyzing sql is like this:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1.25..1.45 rows=20 width=12) (actual time=0.134..0.263 row
s=20 loops=1)
Group Key: ib_income_band_sk
Batches: 1 Memory Usage: 24kB
-> Seq Scan on income_band (cost=0.00..1.20 rows=20 width=12) (actual time=
0.027..0.032 rows=20 loops=1)
Planning Time: 1.741 ms
Execution Time: 0.518 ms
(6 rows)
After running the desired queries don't forget to stop the database:
sudo -u <username> pg_ctl -D <database-name> stop
To get stacked bar chart graphs from analyzed txts, you can use graph_analyzed_txts.py. It uploads analyzed txts to https://explain.depesz.com from given path, gets the source of uploaded url, parses it and extracts table from there, -if --depesz
flag given before, they already stored in a folder, so, by not giving depesz option we can bypass all of these operations to not wait- creates customizable graphs according to given flags.
Firstly install python dependencies:
pip3 install matplotlib
pip3 install BeautifulSoup4
pip3 install lxml
pip3 install numpy
pip3 install pprint
Change required variables in the script, then run the script:
python3 graph_analyzed_txts.py --depesz
This will give an output graph like this:
There are several flags here to use while running graph_analyzed_txts.py:
flag | short version of the flag | description | example usage |
---|---|---|---|
--depesz | -dz | Uploads given analyzed texts to https://explain.depesz.com, if it is the first time for your analyzed texts you need to give this flag, otherwise you don't need it. | -dz |
--hlfunc | -hf | Takes one argument as string to highlight desired function in the graph. | -hf "Sort" |
--querylist | -ql | Takes desired query list as numbers with commas as string to plot special graphs. | -ql "1, 32,3,76 , 55" |
--part | -p | Like querylist option but it makes parted graphs ten by ten like first ten part, fourth ten part. |
-p 5 |
--bottomed | -bt | If you are highlighting desired function and if it is floating in the graph, it makes the bars bottomed. | -bt |
This highlights Index Scan
function in the graph:
python3 graph_analyzed_txts.py -hf "Index Scan"
This makes it bottomed:
python3 graph_analyzed_txts.py -dz -hf "Index Scan" -bt
This is giving a special graph according to given query list and highlights Index Scan
function at the same time:
python3 graph_analyzed_txts.py -hf "Index Scan" --querylist "1, 32,3,76 , 55"
This is giving the seventh part of the main graph:
python3 graph_analyzed_txts.py -p 7
Firstly install hypopg:
export PATH=/home/guest/bsc/postgres-compiled/bin:$PATH
git clone https://github.com/HypoPG/hypopg
USE_PGXS=1 make install
Then install dexter:
wget -qO- https://dl.packager.io/srv/pghero/dexter/key | sudo apt-key add -
sudo wget -O /etc/apt/sources.list.d/dexter.list \
https://dl.packager.io/srv/pghero/dexter/master/installer/ubuntu/$(. /etc/os-release && echo $VERSION_ID).repo
sudo apt-get update
sudo apt-get -y install dexter
Change required variables in create_indexes.py and then run the script to create indexes in desired database:
python3 create_indexes.py