LPN stands for Latency Petri Net. LPN is a representation to model the performance of hardware accelerators. LPN is a variant of petri nets.
This repo contains LPN definitions, LPN tools that operate on LPNs and LPN representations for various hardware accelerators.
We provide this docker image that contains a list of experiments and the LPN repo. You can jump to experiments section to start various experiments we have done.
We constructed example LPNs for the following hardwares:
- JPEG Decoder (an image decoding accelerator)
- Versatile Tensor Accelerator (a ML accelerator)
- Menshen (a RMT pipeline)
- Darwin (a genomic sequence alignment accelerator)
- Protoacc (a protobuf accelerator)
- PCIe Topology (a reconfigurable pcie topology)
lpnlang
(stands for lpn language) python package contains definitions of all LPN constructs and a list of useful tools based on LPN.
To install lpnlang, follow these steps:
bash env_setup.sh
(to install lpnlang package locally)bash setup_klee.sh
(to install KLEE, you can skip this step if you only want to play with LPN in python)
-
Token (usage:
from lpnlang import Token
) : A token is similar to a dictionary. A token has properties and integer values corresponding to each property. Properties are similar to keys in a dictionary. A token also carries an integer timestamp which denotes the time the token is produced. -
Place (usage:
from lpnlang import Place
): A buffer holding tokens. The type of a place equals the property sets of any token it contains. All tokens that appear in one place needs to have the same property sets. -
Transition (usage:
from lpnlang import Transition
): An actor that change the states of LPN by consuming and producing tokens. A transition is enabled when input places have the required tokens, after a delay, the transition will commit or fire which consumes tokens from input places and produces tokens to output places. -
DelayFunc (usage:
from lpnlang import DelayFunc
): is used to construct delay functions that returns the delay for transitions. the delay function returns an integer that defines that time the corresponding transition should wait before commits after being enabled. -
InWeightFunc (usage:
from lpnlang import InWeightFunc
): is used to construct input edge functions that returns the input weight on input edges. InWeightFunc returns a function f, and f returns an integer denoting the weight. One InWeightFunc can be parametrized to create different edge functions with slight variation. -
OutWeightFunc (usage:
from lpnlang import OutWeightFunc
): similar to InWeightFunc, OutWeightFunc is used to construct output edge functions that returns tokens on output edges. -
GuardFunc (usage:
from lpnlang import GuardFunc
): is used to create guardian (boolean) conditions on input edges. Even if the input place has enough tokens as indicated by input weights. if the guardian function returns false, the transition is not enabled. -
ThresholdFunc (usage:
from lpnlang import ThresholdFunc
): is similar to InWeightFunc, that constructs input edge function that returns threshold on input edges. if threshold is defined for an input edge, the transition enabledness condition is checked against the threshold value, however, the number of tokens consumed still equal to the input weights when the transition commits.
Check examples under lpn_family/accel_lib/*/lpn_def to see how those constructs are used.
-
simulate LPN (usage:
from lpnlang import lpn_sim
): simulates LPN in python. -
lpn2visual (usage:
from lpnlang.lpn2visual import lpn_visualize
)(example): converts LPN into an interactive html file. Open the html file in any browser to view graphical LPN. -
lpn2sim (usage:
from lpnlang.lpn2sim import pylpn2cpp
)(example): converts LPN in python to LPN in cpp which can then be compiled into a fast simulator. The generated simulator can't parse commandline inputs, to do that, separate cpp files to parse inputs have to be written manually. -
lpn2pi (usage:
from lpnlang.lpn2pi import lpn_pi
)(example1, example2): converts LPN and user defined symbolic input space into a readable Python program which serves as a performance interface. -
lpn2smt (usage:
from lpnlang.lpn2smt import lpn_smt
)(example): converts LPN and user defined symbolic input space into verification conditions that can then be checked using solvers against user defined queries. -
lpn2symlpn (usage:
from lpnlang.symbex import lpn2symlpn
)(example): converts python LPN and user defined symbolic input space into cpp LPN that can be symbolically executed by KLEE and input classes will be generated after symbolic execution. Input classes splits user defined input space into subspaces that can be processed bylpn2smt
orlpn2pi
one at a time. -
log_one_class (usage:
from lpnlang.symbex import log_out_as_one_class
)(example): converts user defined input space into one input class. It skips symbolic execution however relies on the users to properly define input space as one class. If the users are confident that the input space is one class, they can use this method to skip symbolic execution.
lpn_examples contains LPNs we have built for a list of hardware accelerators and a PCIe topology. In each example, there is a Makefile
, you can run the following commands :
make run_example
: simulates LPN in Python.make run_pi
: generateperf_interface.py
.make run_smt
: run example SMT solving.make run_translate
: translate LPN in Python into LPN in cpp.make run_cpp
: runrun_translate
and compile into a simulator and run.
- Run
docker pull mjccjm/lpn_ae0
- Run
docker image list
and find the one pulled. - Run
docker run -it mjccjm/lpn_ae0
- Accuracy and Speedup of LPN against cycle-accurate simulation.
- Automatically generating fast simulator through
lpn2sim
. - Automatically generating performance interfaces (
lpn2pi
) - Accuracy of generating performance interfaces (accuracy of
lpn2pi
). - Automatically generating performance verification conditions (
lpn2smt
).
Below you can find detailed instructions on how to evaluate each experiment. We use cycle accurate simulators (Verilator or Vivado XSIM), to measure the ground-truth, so obtaining the baseline results could take a while, we mark these experiments with a 🕒. However, experiments that obtain performance metric s using LPNs are generally much faster (see the paper).
-
Accurary and Speedup of LPN:
- LPN is compared with cycle-accurate simulator (Verilator) on accuracy and speedup. We provided accuracy results for
- JPEG decoder:
- Menshen: the experiment includes 2 testbenches.
- 🕒: Running the baseline simulation Menshen requires Vivado XSIM, please consider setting it on your machine. However, for your convenience we report the measured results into the corresponding file.
- Protoacc: is tested with the google hyperprotobench.
- 🕒: the baseline benchmark runs for a few days using Verilator. Hence we provided screen shots for the experiments we rerun (under
/home/experiments/protoacc_exp/protoacc_verilator/screenshots/
). To run the simulation, goto/home/experiments/protoacc_exp/protoacc_verilator
and runbash command.sh sim_binary/verilated_simulator bench<0-5>
(choose 0 to 5 corresponding to 6 benchmarks).
- 🕒: the baseline benchmark runs for a few days using Verilator. Hence we provided screen shots for the experiments we rerun (under
- VTA: the experiment is done through
tvm
autotuning example on vta. - Darwin: the experiment includes 10 test examples.
- We provide speedup results for
- JPEG decoder: the experiment includes 50 images.
- Protoacc: the experiment includes google hyperprotobench.
- 🕒: The baseline could take days to run and requires simulating a RISC-V core, which makes simulation intractably slow. To estimate the runtime we provide a testbed that feeds the accelerator with random data and measures simulation performance. Verilator's performance is known to be independent of the input stimulus, so the measurements are quite accurate.
- VTA: the experiment is from tvm autotuning example with VTA.
- Please use verilator version 4.022 (included in the docker).
- Darwin: the experiment includes 10 test examples.
- Instructions (inside the docker):
- Run tvm vta autotuning
- Open a tmux session (run
tmux
) - Open 3 windows ( to open one more window: ctrl+p then press % or "), in each window goto the following directory:
/home/lpn/lpn_with_x/lpn_in_tvm/tvm/vta/tutorials
(window A)/home/lpn/lpn_with_x/lpn_in_tvm/
(window B)/home/lpn/lpn_with_x/lpn_in_tvm/
(window C)
- To run autotuning experiment:
- Verilator:
- In window C, run
bash compile_verilator.sh
- In window B, run
./rpc_tracker.sh
- In window C, run
./rpc_server.sh > log_verilator_time 2>&1
- In window A, run
bash run_tsim.sh
- Wait for
bash run_tsim.sh
to finish (~2h)
- LPN:
- In window C, run
bash compile_lpn.sh
- In window B, run
./rpc_tracker.sh
- In window C, run
./rpc_server.sh > log_lpn_time 2>&1
- In window A, run
bash run_tsim.sh
- Wait for
bash run_tsim.sh
to finish
- Open a tmux session (run
- Goto
/home/experiments
- Do
bash run_all.sh
- Run tvm vta autotuning
- You will see a list of plots (in pdf) generated in /
home/experiments/pdfs
- To copy the pdfs out of container, usedocker cp <container-id>:<src_in_container> <dst_in_your_machine>
.
- LPN is compared with cycle-accurate simulator (Verilator) on accuracy and speedup. We provided accuracy results for
-
Automatically generating fast simulator through lpn2sim:
-
Automatically generating performance interfaces:
- In the examples, run
make run_pi
, or runmake run_pi_oneclass
. Aperf_interface.py
will be generated. Supported examples: - VTA experiments are not provided, as the interface for instruction sequences gets very large and is not readable at all. If you still want to try extracting, we recommend specify concrete input (not symbolic input space), the same procedure is applied for performance interface extraction but the with no burden of handling large number of symbols that can not be summarized.
- In the examples, run
-
Accuracy of generated performance interfaces:
- Instructions (inside the docker):
- Goto
/home/lpn/lpn_examples/jpeg_decoder/
, domake run_pi
- Goto
/home/lpn/lpn_examples/protoacc/
, domake run_pi_oneclass
- Goto
/home/lpn/lpn_examples/darwin/
, domake run_pi
- Goto
/home/experiments/perf_interface_exp
bash run_all.sh
- Goto
- You will see a pdf under
pdfs/
generated for the accuracy comparsion.
- Instructions (inside the docker):
-
Automatically generating performance verification conditions
- In the examples, run
make run_smt
. It will run example query on example input space. Supported examples: - jpeg decoder
- protoacc
- In the examples, run
If you have any questions or suggestions, feel free to reach out to us at ([email protected]).