Skip to content

shahifaqeer/netflow-anomaly-detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Network Anomaly Detector for Netflow Traces

By: Sarthak Grover

Date: 10/3/2019

Aim: Given some netflow network records, detect anomalous behavior (ex: port scanning)

Source:

  • analyzer_clean.py: batches flows every 10s, and sends the batch for outlier detection. Checks outlier ip_addresses (src and dst combined) to issue alerts.
  • pandas_analysis.py: extract relevant continuous feature set and implement IQR outlier detection for a batch of flows. Returns the src and dst addresses that were outliers based on number of connections and destination ports in the batch.

Additional Code:

  • analyzer.py: Original (unclean) code for outlier detection in multiple steps. Partially implemented and untested. 0. Check basics: packet lengths, local IPs, connection state, protocols, etc.
    1. Check src ip and dst ip against a blacklist set in memory
    2. Check dst_port and index first use for IP address + aggregate bytes
    3. Check IP address and number of ports, protocols, bytes
    4. Aggregate flows for (src_ip, dst_ip) pair every T sec and detect outlies statistically
    5. TODO: cluster input in batches using 'rbf'
    6. TODO: train RNN based LSTM with good data, predict output
    7. TODO: aggregate output of all detectors for each flow and produce trustworthy probability
  • blacklist_update.py: code to update a static blacklist of malicious IP addresses

Data Files:

  • data.csv: test file netflow data
  • blacklist_ips.csv: offline csv of bad IP addresses generated by blacklist_update.py
  • df_flow_features.pkl: pandas object with features extracted from the whole dataset grouped by flow tuple for ML
  • df_src_dst_sampled_10s.pkl: pandas object with complete dataset sampled to 10s intervals for testing ML

Jupyter-Notebooks

netflow-data-scratch-file.ipynb

  • explore netflow data
  • group by given features and explore distribution and statistics
  • loaded full chunk into memory for exploration
  • histograms of destination port usage
  • grouping by srcip-dstip
  • grouping by flow (srcip, sport, dstip, dport, proto)

anomaly-detectors.ipynb

  • feature extraction after grouping by flow
  • input raw features: ['ts','ip_protocol','state','src_ip','src_port','dst_ip','dst_port','src_tx','dst_tx']
  • output features: with (srcip, sport, dstip, dport, proto) as groupby key number of entries/connections: count(entries) grouped time_first_seen = ts1 time_last_seen = ts2 total time for flow (first_seen - current/end) bytes_up (sum src_tx) bytes_dw throughput_up (total bytes/total time) throughput_dw first_state last_state = (state) state_(state) : number of connections per state cidr_src_ip: get supernet string using ipaddr cidr_dst_ip pvt_srcip: private IPs should be declared as private (True/False) pvt_dstip dport_80, dport_8080, dport_443, dport_22 (orthogonal)

Clustering

  • 2 components of PCA basically covered all the variance completely
  • k-means successfully found clusters for k=5, but features are not good for kmeans
  • DBSCAN performed best and separated outliers clearly on plotting. Without true labels couldn't confirm for all.

anomalous-ip-detector.ipynb

  • group by src_ip, dst_ip on whole data set
  • sample the data every 10s to get 420 * (number of ip pairs) samples ~ 342093 samples
  • extract continuous and categorical features for ML: ['bytes_dw', 'bytes_dw', 'num_conns', 'num_flows', 'num_dst_port', 'num_src_port', 'tcp_conns', 'udp_conns', 'cidr_src_ip', 'cidr_dst_ip', 'pvt_src_ip', 'pvt_dst_ip']

Clustering

  • PCA again showed very good results, but required 4 components to cover 94% variance.
  • k-means with multiple values showed low error for k=4 and lowest for k=8. Based on the data shape, rbf was applied.
  • DBSCAN and spectral clustering couldn't work properly due to memory issues.

flow-analyzer.ipynb

  • for implementing and testing pandas_analyzer.py
  • grouped batches of flows every 10s and extracted multiple numerical features
  • 4 main were used in testing for initial design: [num_dst_ports, num_conns, bytes_up, bytes_dw]
  • analyzed each flow as a list and tested IQR against histograms of extracted features
  • group by flow (srcip, sport, dstip, dport, proto)

Results:

  • Almost certain that 192.168.100.96 had the worst behavior.
  • Multiple IPs had too much activity on non-mainstream ports
  • 15 dst_ips used more than a 100 unique ports each.
  • Final simple implementation based on statistical outlier detection:
  • only flow agg IQR with 2 features [num_dst_ports, num_conns]: Total Number of Alerts: 65752
  • only flow agg IQR with num_dst_ports: Total Number of Alerts: 14366
  • only flow agg IQR with num_conns: Total Number of Alerts: 60617
  • only flow agg IQR with all features (including total bytes_up and total bytes_dw size): Total Number of Alerts: 6492
  • All detectors in analyzer.py (basic, blacklist, port agg, ip_addr agg, flow agg): Total Number of Alerts: 92656
  • PCA showed good results in covering the variance of extracted features
  • Clustering worked better for features aggregated by flow tuples, not those aggregated by dst_ip, src_ip
  • Not yet tried machine learning while batch processing
  • Not yet tried unsupervised neural networks for this dataset due to lack of labels and guaranteed good data
  • Last step should be ensemble based trust - instead of adding alerts at each detector, calculate the probability from all filters
  • IQR test is statistically sound, but there are much better algorithms (eg: hosp) that can replace it for histogram based outlier detection. Additionally, batch processing of flows was essentially stateless, but number of new_ports is an important feature to detect port-scans that requires previous state. This is currently missing and should be added.

About

Simple anomaly detection for flow records

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published