order | title |
---|---|
1 |
Method |
This document provides a detailed description of the QA process. It is intended to be used by engineers reproducing the experimental setup for future tests of Tendermint.
The (first iteration of the) QA process as described in the RELEASES.md document was applied to version v0.34.x in order to have a set of results acting as benchmarking baseline. This baseline is then compared with results obtained in later versions.
Out of the testnet-based test cases described in the releases document we focused on two of them: 200 Node Test, and Rotating Nodes Test.
- An account at Digital Ocean (DO), with a high droplet limit (>202)
- The machine to orchestrate the tests should have the following installed:
- A clone of the testnet repository
- This repository contains all the scripts mentioned in the reminder of this section
- Digital Ocean CLI
- Terraform CLI
- Ansible CLI
- A clone of the testnet repository
- Matlab or Octave
- Prometheus server installed
- blockstore DB of one of the full nodes in the testnet
- Prometheus DB
This section explains how the tests were carried out for reproducibility purposes.
- [If you haven't done it before]
Follow steps 1-4 of the
README.md
at the top of the testnet repository to configure Terraform, anddoctl
. - Copy file
testnets/testnet200.toml
ontotestnet.toml
(do NOT commit this change) - Set the variable
VERSION_TAG
in theMakefile
to the git hash that is to be tested. - Follow steps 5-10 of the
README.md
to configure and start the 200 node testnet- WARNING: Do NOT forget to run
make terraform-destroy
as soon as you are done with the tests (see step 9)
- WARNING: Do NOT forget to run
- As a sanity check, connect to the Prometheus node's web interface and check the graph for the
tendermint_consensus_height
metric. All nodes should be increasing their heights. ssh
into thetestnet-load-runner
, then copy scriptscript/200-node-loadscript.sh
and run it from the load runner node.- Before running it, you need to edit the script to provide the IP address of a full node. This node will receive all transactions from the load runner node.
- This script will take about 40 mins to run
- It is running 90-seconds-long experiments in a loop with different loads
- Run
make retrieve-data
to gather all relevant data from the testnet into the orchestrating machine - Verify that the data was collected without errors
- at least one blockstore DB for a Tendermint validator
- the Prometheus database from the Prometheus node
- for extra care, you can run
zip -T
on theprometheus.zip
file and (one of) theblockstore.db.zip
file(s)
- Run
make terraform-destroy
- Don't forget to type
yes
! Otherwise you're in trouble.
- Don't forget to type
The method for extracting the results described here is highly manual (and exploratory) at this stage. The Core team should improve it at every iteration to increase the amount of automation.
-
Unzip the blockstore into a directory
-
Extract the latency report and the raw latencies for all the experiments. Run these commands from the directory containing the blockstore
go run github.com/tendermint/tendermint/test/loadtime/cmd/report@3ec6e424d --database-type goleveldb --data-dir ./ > results/report.txt
go run github.com/tendermint/tendermint/test/loadtime/cmd/report@3ec6e424d --database-type goleveldb --data-dir ./ --csv results/raw.csv
-
File
report.txt
contains an unordered list of experiments with varying concurrent connections and transaction rate- Create files
report01.txt
,report02.txt
,report04.txt
and, for each experiment in filereport.txt
, copy its related lines to the filename that matches the number of connections. - Sort the experiments in
report01.txt
in ascending tx rate order. Likewise forreport02.txt
andreport04.txt
.
- Create files
-
Generate file
report_tabbed.txt
by showing the contentsreport01.txt
,report02.txt
,report04.txt
side by side- This effectively creates a table where rows are a particular tx rate and columns are a particular number of websocket connections.
-
Extract the raw latencies from file
raw.csv
using the following bash loop. This creates a.csv
file and a.dat
file per experiment. The format of the.dat
files is amenable to loading them as matrices in Octaveuuids=($(cat report01.txt report02.txt report04.txt | grep '^Experiment ID: ' | awk '{ print $3 }')) c=1 for i in 01 02 04; do for j in 0025 0050 0100 0200; do echo $i $j $c "${uuids[$c]}" filename=c${i}_r${j} grep ${uuids[$c]} raw.csv > ${filename}.csv cat ${filename}.csv | tr , ' ' | awk '{ print $2, $3 }' > ${filename}.dat c=$(expr $c + 1) done done
-
Enter Octave
-
Load all
.dat
files generated in step 5 into matrices using this Octave code snippetconns = { "01"; "02"; "04" }; rates = { "0025"; "0050"; "0100"; "0200" }; for i = 1:length(conns) for j = 1:length(rates) filename = strcat("c", conns{i}, "_r", rates{j}, ".dat"); load("-ascii", filename); endfor endfor
-
Set variable release to the current release undergoing QA
release = "v0.34.x";
-
Generate a plot with all (or some) experiments, where the X axis is the experiment time, and the y axis is the latency of transactions. The following snippet plots all experiments.
legends = {}; hold off; for i = 1:length(conns) for j = 1:length(rates) data_name = strcat("c", conns{i}, "_r", rates{j}); l = strcat("c=", conns{i}, " r=", rates{j}); m = eval(data_name); plot((m(:,1) - min(m(:,1))) / 1e+9, m(:,2) / 1e+9, "."); hold on; legends(1, end+1) = l; endfor endfor legend(legends, "location", "northeastoutside"); xlabel("experiment time (s)"); ylabel("latency (s)"); t = sprintf("200-node testnet - %s", release); title(t);
-
Consider adjusting the axis, in case you want to compare your results to the baseline, for instance
axis([0, 100, 0, 30], "tic");
-
Use Octave's GUI menu to save the plot (e.g. as
.png
) -
Repeat steps 9 and 10 to obtain as many plots as deemed necessary.
-
To generate a latency vs throughput plot, using the raw CSV file generated in step 2, follow the instructions for the
latency_throughput.py
script.
- Stop the prometheus server if it is running as a service (e.g. a
systemd
unit). - Unzip the prometheus database retrieved from the testnet, and move it to replace the local prometheus database.
- Start the prometheus server and make sure no error logs appear at start up.
- Introduce the metrics you want to gather or plot.
This section explains how the tests were carried out for reproducibility purposes.
- [If you haven't done it before]
Follow steps 1-4 of the
README.md
at the top of the testnet repository to configure Terraform, anddoctl
. - Copy file
testnet_rotating.toml
ontotestnet.toml
(do NOT commit this change) - Set variable
VERSION_TAG
to the git hash that is to be tested. - Run
make terraform-apply EPHEMERAL_SIZE=25
- WARNING: Do NOT forget to run
make terraform-destroy
as soon as you are done with the tests
- WARNING: Do NOT forget to run
- Follow steps 6-10 of the
README.md
to configure and start the "stable" part of the rotating node testnet - As a sanity check, connect to the Prometheus node's web interface and check the graph for the
tendermint_consensus_height
metric. All nodes should be increasing their heights. - On a different shell,
- run
make runload ROTATE_CONNECTIONS=X ROTATE_TX_RATE=Y
X
andY
should reflect a load below the saturation point (see, e.g., this paragraph for further info)
- run
- Run
make rotate
to start the script that creates the ephemeral nodes, and kills them when they are caught up.- WARNING: If you run this command from your laptop, the laptop needs to be up and connected for full length of the experiment.
- When the height of the chain reaches 3000, stop the
make rotate
script - When the rotate script has made two iterations (i.e., all ephemeral nodes have caught up twice)
after height 3000 was reached, stop
make rotate
- Run
make retrieve-data
to gather all relevant data from the testnet into the orchestrating machine - Verify that the data was collected without errors
- at least one blockstore DB for a Tendermint validator
- the Prometheus database from the Prometheus node
- for extra care, you can run
zip -T
on theprometheus.zip
file and (one of) theblockstore.db.zip
file(s)
- Run
make terraform-destroy
Steps 8 to 10 are highly manual at the moment and will be improved in next iterations.
In order to obtain a latency plot, follow the instructions above for the 200 node experiment, but:
- The
results.txt
file contains only one experiment - Therefore, no need for any
for
loops
As for prometheus, the same method as for the 200 node experiment can be applied.