By: Sarthak Grover
Date: 10/3/2019
Aim: Given some netflow network records, detect anomalous behavior (ex: port scanning)
- analyzer_clean.py: batches flows every 10s, and sends the batch for outlier detection. Checks outlier ip_addresses (src and dst combined) to issue alerts.
- pandas_analysis.py: extract relevant continuous feature set and implement IQR outlier detection for a batch of flows. Returns the src and dst addresses that were outliers based on number of connections and destination ports in the batch.
- analyzer.py: Original (unclean) code for outlier detection in multiple steps. Partially implemented and untested.
0. Check basics: packet lengths, local IPs, connection state, protocols, etc.
- Check src ip and dst ip against a blacklist set in memory
- Check dst_port and index first use for IP address + aggregate bytes
- Check IP address and number of ports, protocols, bytes
- Aggregate flows for (src_ip, dst_ip) pair every T sec and detect outlies statistically
- TODO: cluster input in batches using 'rbf'
- TODO: train RNN based LSTM with good data, predict output
- TODO: aggregate output of all detectors for each flow and produce trustworthy probability
- blacklist_update.py: code to update a static blacklist of malicious IP addresses
- data.csv: test file netflow data
- blacklist_ips.csv: offline csv of bad IP addresses generated by blacklist_update.py
- df_flow_features.pkl: pandas object with features extracted from the whole dataset grouped by flow tuple for ML
- df_src_dst_sampled_10s.pkl: pandas object with complete dataset sampled to 10s intervals for testing ML
- explore netflow data
- group by given features and explore distribution and statistics
- loaded full chunk into memory for exploration
- histograms of destination port usage
- grouping by srcip-dstip
- grouping by flow (srcip, sport, dstip, dport, proto)
- feature extraction after grouping by flow
- input raw features: ['ts','ip_protocol','state','src_ip','src_port','dst_ip','dst_port','src_tx','dst_tx']
- output features: with (srcip, sport, dstip, dport, proto) as groupby key number of entries/connections: count(entries) grouped time_first_seen = ts1 time_last_seen = ts2 total time for flow (first_seen - current/end) bytes_up (sum src_tx) bytes_dw throughput_up (total bytes/total time) throughput_dw first_state last_state = (state) state_(state) : number of connections per state cidr_src_ip: get supernet string using ipaddr cidr_dst_ip pvt_srcip: private IPs should be declared as private (True/False) pvt_dstip dport_80, dport_8080, dport_443, dport_22 (orthogonal)
- 2 components of PCA basically covered all the variance completely
- k-means successfully found clusters for k=5, but features are not good for kmeans
- DBSCAN performed best and separated outliers clearly on plotting. Without true labels couldn't confirm for all.
- group by src_ip, dst_ip on whole data set
- sample the data every 10s to get 420 * (number of ip pairs) samples ~ 342093 samples
- extract continuous and categorical features for ML: ['bytes_dw', 'bytes_dw', 'num_conns', 'num_flows', 'num_dst_port', 'num_src_port', 'tcp_conns', 'udp_conns', 'cidr_src_ip', 'cidr_dst_ip', 'pvt_src_ip', 'pvt_dst_ip']
- PCA again showed very good results, but required 4 components to cover 94% variance.
- k-means with multiple values showed low error for k=4 and lowest for k=8. Based on the data shape, rbf was applied.
- DBSCAN and spectral clustering couldn't work properly due to memory issues.
- for implementing and testing pandas_analyzer.py
- grouped batches of flows every 10s and extracted multiple numerical features
- 4 main were used in testing for initial design: [num_dst_ports, num_conns, bytes_up, bytes_dw]
- analyzed each flow as a list and tested IQR against histograms of extracted features
- group by flow (srcip, sport, dstip, dport, proto)
- Almost certain that 192.168.100.96 had the worst behavior.
- Multiple IPs had too much activity on non-mainstream ports
- 15 dst_ips used more than a 100 unique ports each.
- Final simple implementation based on statistical outlier detection:
- only flow agg IQR with 2 features [num_dst_ports, num_conns]: Total Number of Alerts: 65752
- only flow agg IQR with num_dst_ports: Total Number of Alerts: 14366
- only flow agg IQR with num_conns: Total Number of Alerts: 60617
- only flow agg IQR with all features (including total bytes_up and total bytes_dw size): Total Number of Alerts: 6492
- All detectors in analyzer.py (basic, blacklist, port agg, ip_addr agg, flow agg): Total Number of Alerts: 92656
- PCA showed good results in covering the variance of extracted features
- Clustering worked better for features aggregated by flow tuples, not those aggregated by dst_ip, src_ip
- Not yet tried machine learning while batch processing
- Not yet tried unsupervised neural networks for this dataset due to lack of labels and guaranteed good data
- Last step should be ensemble based trust - instead of adding alerts at each detector, calculate the probability from all filters
- IQR test is statistically sound, but there are much better algorithms (eg: hosp) that can replace it for histogram based outlier detection. Additionally, batch processing of flows was essentially stateless, but number of new_ports is an important feature to detect port-scans that requires previous state. This is currently missing and should be added.