Network Anomaly Detector for Netflow Traces

By: Sarthak Grover

Date: 10/3/2019

Aim: Given some netflow network records, detect anomalous behavior (ex: port scanning)

Source:

analyzer_clean.py: batches flows every 10s, and sends the batch for outlier detection. Checks outlier ip_addresses (src and dst combined) to issue alerts.
pandas_analysis.py: extract relevant continuous feature set and implement IQR outlier detection for a batch of flows. Returns the src and dst addresses that were outliers based on number of connections and destination ports in the batch.

Additional Code:

analyzer.py: Original (unclean) code for outlier detection in multiple steps. Partially implemented and untested. 0. Check basics: packet lengths, local IPs, connection state, protocols, etc.
1. Check src ip and dst ip against a blacklist set in memory
2. Check dst_port and index first use for IP address + aggregate bytes
3. Check IP address and number of ports, protocols, bytes
4. Aggregate flows for (src_ip, dst_ip) pair every T sec and detect outlies statistically
5. TODO: cluster input in batches using 'rbf'
6. TODO: train RNN based LSTM with good data, predict output
7. TODO: aggregate output of all detectors for each flow and produce trustworthy probability
blacklist_update.py: code to update a static blacklist of malicious IP addresses

Data Files:

data.csv: test file netflow data
blacklist_ips.csv: offline csv of bad IP addresses generated by blacklist_update.py
df_flow_features.pkl: pandas object with features extracted from the whole dataset grouped by flow tuple for ML
df_src_dst_sampled_10s.pkl: pandas object with complete dataset sampled to 10s intervals for testing ML

Jupyter-Notebooks

netflow-data-scratch-file.ipynb

explore netflow data
group by given features and explore distribution and statistics
loaded full chunk into memory for exploration
histograms of destination port usage
grouping by srcip-dstip
grouping by flow (srcip, sport, dstip, dport, proto)

anomaly-detectors.ipynb

feature extraction after grouping by flow
input raw features: ['ts','ip_protocol','state','src_ip','src_port','dst_ip','dst_port','src_tx','dst_tx']
output features: with (srcip, sport, dstip, dport, proto) as groupby key number of entries/connections: count(entries) grouped time_first_seen = ts1 time_last_seen = ts2 total time for flow (first_seen - current/end) bytes_up (sum src_tx) bytes_dw throughput_up (total bytes/total time) throughput_dw first_state last_state = (state) state_(state) : number of connections per state cidr_src_ip: get supernet string using ipaddr cidr_dst_ip pvt_srcip: private IPs should be declared as private (True/False) pvt_dstip dport_80, dport_8080, dport_443, dport_22 (orthogonal)

Clustering

2 components of PCA basically covered all the variance completely
k-means successfully found clusters for k=5, but features are not good for kmeans
DBSCAN performed best and separated outliers clearly on plotting. Without true labels couldn't confirm for all.

anomalous-ip-detector.ipynb

group by src_ip, dst_ip on whole data set
sample the data every 10s to get 420 * (number of ip pairs) samples ~ 342093 samples
extract continuous and categorical features for ML: ['bytes_dw', 'bytes_dw', 'num_conns', 'num_flows', 'num_dst_port', 'num_src_port', 'tcp_conns', 'udp_conns', 'cidr_src_ip', 'cidr_dst_ip', 'pvt_src_ip', 'pvt_dst_ip']

Clustering

PCA again showed very good results, but required 4 components to cover 94% variance.
k-means with multiple values showed low error for k=4 and lowest for k=8. Based on the data shape, rbf was applied.
DBSCAN and spectral clustering couldn't work properly due to memory issues.

flow-analyzer.ipynb

for implementing and testing pandas_analyzer.py
grouped batches of flows every 10s and extracted multiple numerical features
4 main were used in testing for initial design: [num_dst_ports, num_conns, bytes_up, bytes_dw]
analyzed each flow as a list and tested IQR against histograms of extracted features
group by flow (srcip, sport, dstip, dport, proto)

Results:

Almost certain that 192.168.100.96 had the worst behavior.
Multiple IPs had too much activity on non-mainstream ports
15 dst_ips used more than a 100 unique ports each.
Final simple implementation based on statistical outlier detection:
only flow agg IQR with 2 features [num_dst_ports, num_conns]: Total Number of Alerts: 65752
only flow agg IQR with num_dst_ports: Total Number of Alerts: 14366
only flow agg IQR with num_conns: Total Number of Alerts: 60617
only flow agg IQR with all features (including total bytes_up and total bytes_dw size): Total Number of Alerts: 6492
All detectors in analyzer.py (basic, blacklist, port agg, ip_addr agg, flow agg): Total Number of Alerts: 92656
PCA showed good results in covering the variance of extracted features
Clustering worked better for features aggregated by flow tuples, not those aggregated by dst_ip, src_ip
Not yet tried machine learning while batch processing
Not yet tried unsupervised neural networks for this dataset due to lack of labels and guaranteed good data
Last step should be ensemble based trust - instead of adding alerts at each detector, calculate the probability from all filters
IQR test is statistically sound, but there are much better algorithms (eg: hosp) that can replace it for histogram based outlier detection. Additionally, batch processing of flows was essentially stateless, but number of new_ports is an important feature to detect port-scans that requires previous state. This is currently missing and should be added.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Network Anomaly Detector for Netflow Traces

Source:

Additional Code:

Data Files:

Jupyter-Notebooks

netflow-data-scratch-file.ipynb

anomaly-detectors.ipynb

Clustering

anomalous-ip-detector.ipynb

Clustering

flow-analyzer.ipynb

Results:

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
results/features		results/features
.gitignore		.gitignore
README.md		README.md
analyzer.py		analyzer.py
analyzer_clean.py		analyzer_clean.py
anomalous-ip-detector.ipynb		anomalous-ip-detector.ipynb
anomaly-detectors.ipynb		anomaly-detectors.ipynb
blacklist_ips.csv		blacklist_ips.csv
blacklist_update.py		blacklist_update.py
df_features_per_batch.pkl		df_features_per_batch.pkl
df_flow_features.pkl		df_flow_features.pkl
df_src_dst_sampled_10s.pkl		df_src_dst_sampled_10s.pkl
feature-extraction-by-batch.ipynb		feature-extraction-by-batch.ipynb
flow-analyzer.ipynb		flow-analyzer.ipynb
netflow-data-scratch-file.ipynb		netflow-data-scratch-file.ipynb
pandas_analysis.py		pandas_analysis.py
running-window-entropy-based-anomaly-detector.ipynb		running-window-entropy-based-anomaly-detector.ipynb

shahifaqeer/netflow-anomaly-detector

Folders and files

Latest commit

History

Repository files navigation

Network Anomaly Detector for Netflow Traces

Source:

Additional Code:

Data Files:

Jupyter-Notebooks

netflow-data-scratch-file.ipynb

anomaly-detectors.ipynb

Clustering

anomalous-ip-detector.ipynb

Clustering

flow-analyzer.ipynb

Results:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages