Half-Space Tree, known as HSTree, is an efficient data structure for detecting anomalies in streaming data. It is robust when incoming data is endless and contains a large amount of normal data. HSTree has a constant amortised time complexity and a constant memory requirement when processing data. It also has a robust detection accuracy when parameters are different.
This repository contains my implementation of HSTree data structure and tests with HTML and SMTP data. The implementation is in Python3 and the idea of HSTree is based on the Tan, Ting and Liu's research[1].
Python (>=3.3)
This script is written in Python3 style. Please make sure the latest Python3 is installed on your computer before the next steps. If you already have Python installed, please make sure its version should be at least 3.3.
NumPy (>= 1.8.2)
andSciPy (>= 0.13.3)
NumPy
is the fundamental package for scientific computing with Python and SciPy
has a collection of numerical algorithms. If you don't have NumPy
and SciPy
installed, please use the following command to install them.[2]
python3 -m pip install --user numpy scipy
scikit-learn
This script used the MinMaxScaler
module of Scikit-learn
. Please also install this package using the following command.[3]
pip install scikit-learn
- Other packages
If you miss the prequisites of the packages listed in 1 - 3, please follow the instructions to install them.
HS tree.py
: The implementation of HSTree with initialization of test cases.http.csv
andlabels.csv
: The sample HTTP data and its corresponding labels.smtp.csv
andsmtp_labels.csv
: The sample SMTP data and its corresponding labels.
The corresponding label files are only used for test the accuracy of anomaly detection.
The method BuildSingleHSTree
is used to build one single HSTree
. It returns a value of Node
class and requires the following five input arguments:
max_arr
: An array containing the maximum values of each dimension.
min_arr
: An array containing the minimum values of each dimension.
k
: The depth of the current node.
h
: The maximum depth value.
dimensions
: The number of dimensions.
This method updates the mass profile of normal data in each node. It takes the following three input arguments:
x
: An instance of data.
node
: The node in an HSTree.
ref_window
: Boolean value. If the instance x
is in current reference window, then it is true
. Otherwise ref_window
is set to false
.
This method is the major sequence of evaluating anomalies in streaing data. It takes four arguments, X
, psi
, t
and h
, as inputs and returns a list
of scores of the instances. This method can be called directly for test data. The input arguments are specified as the following:
X
: The input data. This implementation only simulates the data streaming process, so all data is provided at the beginning.
psi
: Window Size. The HSTrees are initialized with the first psi
instances. In this implementation, the first psi
elements in X
will be the initial data of HSTrees.
t
: Number of HSTrees.
h
: The maximum depth of each HSTree.
The method accuracy_value
is a helper method to evaluate the test results. It prints four values: True Positive
, False Positive
, True Negative
and False Negative
. The inputs are:
scores
: A List
of scores. It should be the return value of StreamingHSTrees()
method.
y
: A List
of reference labels (Ground truth).
num
: The total number of anomalies in the dataset. This implementation is not elegant. I will refactor it in the future.
To test this data structure, you need to do the following steps.
The input sequence can be hard-coded or loaded from files. Please initialize it with Numpy. Here are two examples.
# Initialize with hard-coded data
X = [[0.5], [0.45], [0.43], [0.44], [0.445], [0.45], [0.0]]
X = np.array(X)
# Load from file
X = np.genfromtxt('http.csv',delimiter=',')
If the data is not distributed in 0-1, we need to preprocess the data to satisfy this property.
# Use MinMaxScaler in scikit-learn to preprocess the data
scaler = MinMaxScaler()
X_new = scaler.fit_transform(X)
Call the method StreamingHSTrees
to simulate data streaming process and get a list of scores of the evaluated instances. You may need to adjust the parameters to get a better accuracy, but HSTree is not sensitive to parameters.
# Parameters need to adjust
final_scores = StreamingHSTrees(X_new, psi, t, h)
Call accuracy_value
to evaluate the anomaly detection in step 3.
# "y" is the reference labels and "13" is the number of the anomalies.
accuracy_value(final_scores, y, 13)
This implementation of HSTree only simulates its procedures. It can be modified and extended to receive real streaming data. The script will be able to read data instances dynamically from text files or real Internet streams.
-
Fast Anomaly Detection for Streaming Data (Tan, Ting, Liu) https://pdfs.semanticscholar.org/73b6/b7d9e7e225719ad86234927a3b60a4a873c0.pdf
-
SciPy Official Site https://www.scipy.org/about.html
-
Installing scikit-learn http://scikit-learn.org/stable/install.html