This document provides a step-by-step guide on how to use CuCAMASim for decision tree inference simulation. CuCAMASim is a CUDA-based implementation of CAMASim, a decision tree inference simulator.
Please refer to README.
Assume the built executable is at build/CuCAMASim_runner
, then simply run the simulation by ./build/CuCAMASim_runner
.
You can choose from a list of implemented tasks for simulation. Currently, we have implemented the following tasks:
- CAM-based decision tree inference: In this task, we use CuCAMASim to simulate the decision tree inference with CAM. You can change the setting in the config file to simulate CAM inference accuracy under different noise setups and mapping strategies. A noise-free software-based inference is also performed at the begining of the task for comparing the inference accuracy with CAM-based inference. You can use
--task CAM_inference
to perform this task. This is the default task. - Software-based decision tree inference: In this task, we perform software-based decision tree inference without CAM. The inference process can be noisy, according to the content in the config file. You can use
--task software_inference
to perform this task. Please also explicitly select an corresponding software inference config file for this task. - Statistics for error distribution: In this task, we do the statistics for the error distribution of the noisy inference process. We calculate the proportion of each error type, e.g. no-match error, one-match error, etc. You can use
--task errDistr
to perform this task. Since randomness is involved, you can also specify a--sample_time <number of times>
argument to repeat the simulation for a given number of times and give a final statistical result. - Print information: In this task, we print basic informations of the dataset, trained tree, etc. You can use
--task print_info
to perform this task.
You can specify a config file for either CAM-based or software-based inference as an argument when running the executable. For example, to run the simulation with a specific config file, you would run:
./build/CuCAMASim_runner --config <config_file_path>
The config file type (CAM config/software inference config) should match the task type.
You can choose from a list of implemented datasets for simulation. Currently, we have implemented the following datasets:
- Belgium TSC
- Breast Cancer Wisconsin (Diagnostic)
- Eye-movements
- gas_concentrations
- Gesture Phase Segmentation
- iris
- MNIST
- Haberman's Survival
The dataset details including how they are loaded and stored can be found in
data/datasets
folder.
To run the simulation with a specific dataset, you need to specify the dataset name as an argument when running the executable. For example, to run the simulation with the MNIST dataset, you would run:
./build/CuCAMASim_runner --dataset MNIST
The dataset name passed as an argument might be different with its original name. Generally, the dataset name passed as argument should be the same with the dataset folder name in
data/datasets
. You can check thedatasetPath
variable inloadDataset()
function inCuCAMASim/util/data.cpp
to check all avaliable datasets.
For most datasets, we provide two versions: original and normalized. Both versions use a 70%-30% train-test split. The original version is not pre-processed. For the normalized version, we pre-process by normalizing all features to the range [0, 1]. This means the minimum and maximum values of each feature are scaled to 0 and 1, respectively. This normalization aims to enhance robustness against ACAM device variation. To use the normalized version, add suffix _normalized
to the dataset name for most datasets. For example, you would run the following command for the normalized version of MNIST dataset:
./build/CuCAMASim_runner --dataset MNIST_normalized
By default, the adapted version of Belgium TSC
dataset is used (equivalently --datset BTSC_adapted
).
You can choose a different decision tree than the default one for simulation. The usage is as following:
./build/CuCAMASim_runner --use_trained_tree <tree_text_file_path>
If this argument is not specified, the default tree text for each dataset will be used. The default tree text for each dataset can be found in
getTreeTextPath()
function inapplications/DecisionTree/dt2cam.cpp
When simulating a decision tree inference task, two elements are required: dataset and decision tree text. Therefore, adding new dataset is a matter of preparing these two elements and correponding configurations for CuCAMASim simulation.
The dataset should be in a format that CuCAMASim can read. This format is a .mat
file which was originally used by MATLAB to store matrix data. We use matio library for reading the .mat
file. The .mat
file should contain 4 matrices, namely trainInputs
, testInputs
, trainLabels
, and testLabels
.
For the implemented datasets, we use a seperate python script to generate the .mat
file. The following content uses the preparation of "Breast Cancer Wisconsin (Diagnostic)" dataset as an example.
The first step is to load dataset from the UCI ML repo, making a train/test split and formatting the data. In the end, 4 numpy arrays should be given, which contains trainInputs
, testInputs
, trainLabels
, and testLabels
. The trainInputs
and testInputs
are 2-dimensional arrays, where each row is a sample and each column is a feature. The trainLabels
and testLabels
are 1-dimensional arrays, where each element is the label of the corresponding sample. This step is done in load_original_dataset()
in data/datasets/breast_cancer/breast_cancer.py
.
The second step is to store the datasets in a .mat
file. We recommand using scipy.io.savemat()
function with the following code. Please note that we use the transpose of trainInputs
and testInputs
to make the format consistent with the original .mat
file.
scipy.io.savemat(
outputPath,
{
"trainInputs": trainInputs.T.astype(np.float64),
"testInputs": testInputs.T.astype(np.float64),
"trainLabels": trainLabels.astype(np.uint64),
"testLabels": testLabels.astype(np.uint64),
},
)
For the second step, we provide a wrapper function, i.e. convert.datasetName2matFile()
in run_script/util.py
. Before using this function, you also need to add support for the new dataset in util.loadDataset()
function in run_script/util.py
. This step may seems redundant but recommanded, since the util.loadDataset()
function will be used for training a new decision tree on the new dataset. Also, a similar function provides dataset loading support for CAMASim
(EVACAMx
) so it may save some time when you want to run the new datset on both CAMASim
and CuCAMASim
. After adding support in util.loadDataset()
, you can simply provide the name for the dataset and the output path for the .mat
file in the wrapper function and let the script do the rest. You may also normalize the values of each feature in the dataset by specifying normalize=True
argument when calling convert.datasetName2matFile()
. This step is done in data/datasets/breast_cancer/save2mat.py
In the end, you should get a .mat
file which contains trainInputs
, testInputs
, trainLabels
, and testLabels
.
Another thing is to train a new decision tree and prepare the tree text for inference. We recommand using scikit-learn
library for training the decision tree and save the tree text. For "Breast Cancer Wisconsin (Diagnostic)" dataset, this is done by data/treeText/breast_cancer/saveTreeText.py
script.
After preparing the .mat
dataset file and the tree text, you need to add relative information in the code. To be specific, you need to add information about the dataset name to the .mat
file path/the tree text path to the code.
The dataset name to the .mat
file path mapping is done in loadDataset()
function in CuCAMASim/util/data.cpp
, and the dataset name to the tree text path mapping is done in getTreeTextPath()
function in applications/DecisionTree/dt2cam.cpp
. You need to add a new key-value pair in the correspinding variable in both functions.
For the exemplar "Breast Cancer Wisconsin (Diagnostic)" dataset, you need to add something like
{"breast_cancer", "/workspaces/CuCAMASim/data/datasets/breast_cancer/breast_cancer.mat"}
in datasetPath
variable in loadDataset()
and add something like
{"breast_cancer", "/workspaces/CuCAMASim/data/treeText/breast_cancer/breast_cancer.txt"}
in treeTextPath
variable in getTreeTextPath()
.
After adding support for the new dataset in the code, you need to re-compile the code. Then, you can simply run the simulation on the new dataset by providing the dataset name.
./build/CuCAMASim_runner --dataset breast_cancer