The dataset is compiled by Fatih Ertam at Firat University, Turkey The dataset contains 65532 instances with 12 features collected by logs of the university firewall system. The summary of the features is given below:
The action which is our response variable is categorized
in four classes.
We employed various EDA techniques to analyse the data and the performed data preprocessing. we have utilised feature engineering to add some extra features and then built model upon the transformed data. all these steps are explained in detail in the source code.
Feature relevance according to RF classifier is shown as below
we used log loss as a metric for comparision.
The baseline model has log-loss of 1.45. and accuracy model, we have to get log-loss well below 1.45. Keeping that in Mind we proceeded with making 7 different models. From the results shown in figure 4, it is clear that random forest classifier is the best model in terms of test log-loss as it has test log loss of 0.0095.However we can say thatin the terms of runtime the best model is Decision tree asit has a value of 0.0001 for the same. However we want amodel that satisfies the low latency requirement along withbeing capable of producing reliable predictions on firewall traffic logs, so if we want something that is the best of bothworlds we can say Support vectors model handles it well.