Skip to content

Latest commit

 

History

History
83 lines (62 loc) · 5.8 KB

File metadata and controls

83 lines (62 loc) · 5.8 KB

BigQuery Audit Log Anomany Detection

BQ Audit log anomanly detection is a tool which uses BigQuery Audit Logs, specifically in the AuditData format, for automated analysis of Big Data Cloud environments with a focus on BigQuery.

The tool summarized and aggregated BigQuery Audit Logs into metric values that provide insights into BigQuery jobs. Anomalous datapoints are determined by how similar they are to other datapoints in the audit logs.

To identify outliers, this tool showcases two methods:

  1. Outliers in groups: This method looks for a datapoint that differs signicantly from others within various groups (can be one of the following: 'principalEmail', 'eventName', 'projectId', 'dayOfWeek', 'hourOfDay'). This means that it identifies entities within the group which use BQ differently (more or less) from others.
  2. Time Series Analysis: Looking for outliers in periodic trends by looking at audit logs chronologically. This method has an underlying assumption that BigQuery usage has trends.

A sample of the outputs can be found in audit_log_anomaly_detection.ipynb.

Requirements

This demo showcases the outputs of the tool when identifying anomalous BQ usage. To run the tool locally, you would require the following requirements:

  • Access to Cloud Data Access Logs in BigQuery: Usually labelled cloudaudit_googleapis_com_data_access_* where * is a wildcard usually for the date. The Cloud Data Access Audit logs has to be in bigquery. To have access to BigQuery Audit Logs in BigQuery, you would have to create a Log Sink.
  • Same location of source and destination datasets: As per BigQuery, the created view's destination dataset has to be in the same GCP location as the source dataset.
  • Logs in AuditData format: This tool is build using the AuditData log format (refered to in the logs as V1) and requires that protopayload_auditlog.servicedata_v1_bigquery to be one of the fields in the Audit Logs.
  • IAM Access: Access to the the project, destination project, audit log dataset and destination audit log dataset.
  • Jupyter Notebook: Can be run using local jupyter notebook or an ai platform notebook of GCP.

Set Up

var.env file

Configure the location where the tool will extract the required data by changing the following variables in the var.env file.

The Source Project Dataset variables determine where the audit log data will be drawn from. The Destination View variables determine where the generated intermediate view needed for further analysis is created.

# Source Project Dataset
project_id       = "your-project-id"
data_project_id  = "your-project-id" (if different)
dataset_id       = "your-log-dataset"
audit_log_table  = "your_audit_log_data_access_*"
location         = "data_location"
audit_log_partitioned = True or False 

## Destination View 
destination_project_id = "your-project-id"
destination_dataset_id = "your-dataset-id"
summary_table_name     = "your-view-name"

How it works

Creating a log sink: Getting the BigQuery log Data

A detailed description of steps to collect BigQuery logs for your GCP Projects can be found here.

Summarizing Audit Logs

The first step is carrying out aggregations from BigQuery Data Access Audit logs. This is executed in viewsFactory.py. The aggregation involves using Job statistics to extract useful query and resource metric information.

Outliers in groups

The algorithm to identify outliers within a group is as follows:

  1. Aggregate the average for each individual in a selected group (eg: principalEmail).
  2. Calculate the mean and standard deviation across all the groups.
  3. User selects a sigma value which determines how stringent to look for outliers. For example, if you expect all emails in your bq environment to carrying about roughly equal jobs, you will set a lower sigma value.
  4. If the individual score more than sigma from the standard deviation, the point will be flagged as an outlier.

Time Series Analysis

The algorithm to identify time series outliers is as follows:

  1. Calculated the average totalTablesProcessed per query per hour.
  2. Carry out STL composition to decompose the time series into Seasonlity and Trend which is used to calculate an estimate:
    estimation = trend + seasonal
  3. Outliers are determined by finding the difference between the estimated reconstructed time series and the actual value (residual). Any residual above a sigma threshold (user defined) is flagged as an outlier in the plot below.

No code view

Voila allows notebooks to be run with a no-code view for increased usability. To view a fully functional version of this sample without the code blocks, append the end of the Jupyter URL with /voila.

For example:

Interactive Plots

The plots rendered are made with Plotly as interactive plot. However, the jupyter notebook rendered in github is static. To view the interactive plot in a live kernel notebook, follow the instruction to comment out the line for pio.renderers.default = github and uncomment the pio.renderers.default = jupyter

jupyter = 'plotly_mimetype+notebook_connected'
github = 'svg'

# Comment out the Github line and Uncomment the jupyter line when running in jupyter to see interactive graphs
pio.renderers.default = github
# pio.renderers.default = jupyter