A library and collection of scripts to capture sparse memory allocation and access traces on a workload.
Heimdallr PMU relies on perf and several other features specific to the GNU/Linux platform.
However, successful experiments on Intel x86 and POWER9 indicate compatibility with a variety of hardware platforms.
To run the Python scripts, create a fresh venv and install the requirements from vis/requirements.txt.
$ python3 -m venv ./venv
$ source ./venv/bin/activate
$ python3 -m pip install -r vis/requirements.txt
To prepare tracing a workload, build the tracealloc library, which will interpose the relevant library calls to capture allocation events.
It requires libmemkind for customization of the memory resources used to back workload allocations.
Create a tracealloc/build/ directory and initialize the environment and build the library.
$ mkdir tracealloc/build/ && cd tracealloc/build/
$ cmake ..
$ make -j <ncpu>
The runs.sh script wraps workloads with the necessary infrastructure to collect sparse traces.
$ ./runs.sh $HOME/traces_npb 8 $HOME/npb_bin/*.A
The example above will run all NPB benchmarks of size class A found in the $HOME/npb_bin/ directory separately with 8 repetitions.
Results will be written to subdirectories under $HOME/traces_npb/ for each individual run.
Each argument beginning with the third is taken as the commandline of a separate workload to be executed.
At the top of the runs.sh script are default definitions for several variables that should be customized for the target machine:
HEIMDALLR_READEVENTthe name of theperfevent used to sample read (load) accessesHEIMDALLR_WRITEEVENTthe name of theperfevent used to sample write (store) accessesHEIMDALLR_PMEMDIRif PMEM should be used to back allocations, this is the mountpoint for the DAX filesystemHEIMDALLR_PMEMSIZEif PMEM should be used to back allocations, this is the byte size of the reserved space
At a later point in runs.sh follows the definition of execution parameters for multiple series of runs:
# $mode $hms $mem $cpu $dry $freq
echo "1loc_DR 0 1 1 1 0" >> $modefile
echo "1loc_50 0 1 1 0 5000" >> $modefile
Defines two configurations, one dryrun without any instrumentation but time to capture the natural runtime of the workload, and one with both the allocation interposer library and perf record to capture actual traces, both allocating NUMA local memory.
$modeidentifier string appended to the result directory to distinguish different configurations$hmswhen set to1, causes the tracealloc library to use PMEM backing memory$memselects the NUMA node to serve both workload (unlesshms=1) and instrumentation allocations$cpuselects the NUMA node to pin workload execution to$drywhen set to1, disables the tracealloc library, i.e.,hms=1becomes irrelevant and no allocations are captured$freqselects the target sampling frequency forperf recordin Hz. When set to0, disablesperfinstrumentation completely and no accesses are captured
Runs for different configurations are executed in sequence, instead of taking part in the shuffling scheme between workloads and repetitions.
The choice of $freq must be supported by the system configuration, as by default the kernel has an upper limit on the sampling frequency which gets lowered dynamically if PMU interrupts take to much time on average.
This behaviour can be changed with the following sysconf parameters:
/proc/sys/kernel/perf_event_max_sample_rateis the upper frequency limit in Hz, which even if explicitly set can be dynamically lowered/proc/sys/kernel/perf_cpu_time_max_percentmodulates the aggressiveness of dynamic limit lowering. If set to100,perf_event_max_sample_ratewill not be modified dynamically, but curiously, no user changes to the value seem to be accepted either
Visualizing trace results is a two-step process: First, raw trace data is parsed from the result directory and reorganized into an sqlite database file, which can hold traces for several runs. Run identifiers are based on result subdirectory names. The actual visualization script queries data from the trace database to improve performance compared to parsing the raw trace files repeatedly on the fly.
$ vis/analyze.py -i $HOME/traces_npb -o ./traces_npb.sqlite
This command performs the first step, parses and combines the raw result files with access and allocation traces.
In case several iterations of the same run exist in the result directory, only the first repetition is parsed completely, while for subsequent instances only execution statistics are recorded to the database to save space and time.
The --all commandline argument overrides this behaviour and parses all repetitions of a run.
The vis/visualize.py script works with the resulting trace database:
$ vis/visualize.py ./traces_npb.sqlite --list # (1)
$ vis/visualize.py ./traces_npb.sqlite --stat # (2)
$ vis/visualize.py ./traces_npb.sqlite --run bt.A.hms # (3)
Commandline (1) lists all run identifiers that are present in the trace database, whereas (2) aggregates execution and overhead statistics for each run and its repetitions.
Commandline (3) spawns an interactive matplotlib window visualizing the trace of a given run identifier.
The last form of the command takes several optional parameters to control visual appearance and select trace subsets on the time and address axes for quicker rendering of interesting regions.
Refer to visualize.py --help for details.