Skip to content

SageMaker implementation of LSTM-AD model for time series anomaly detection.

Notifications You must be signed in to change notification settings

fg-research/lstm-ad-sagemaker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LSTM-AD SageMaker Algorithm

The Time Series Anomaly Detection (LSTM-AD) Algorithm from AWS Marketplace performs time series anomaly detection with the Long Short-Term Memory Network for Anomaly Detection (LSTM-AD). It implements both training and inference from CSV data and supports both CPU and GPU instances. The training and inference Docker images were built by extending the PyTorch 2.1.0 Python 3.10 SageMaker containers.

Model Description

The LSTM-AD model predicts the time series with a multivariate stacked LSTM model. The model parameters are learned on a training set containing only normal data (i.e. without anomalies) by minimizing the mean squared error (MSE) between the actual and predicted values of the time series. After the model has been trained, a multivariate Gaussian distribution is fitted to the model's prediction errors on an independent validation set (also without anomalies) using Maximum Likelihood Estimation (MLE).

At inference time, the model predicts the values of all the time series (which can now include anomalies) at each time step, and calculates the likelihood of the model's prediction errors under the fitted multivariate Gaussian distribution. The computed Gaussian likelihood is then used as a normality score: the lower the Gaussian likelihood at a given a time step, the more likely the time step is to be an anomaly.

LSTM-AD architecture (source: ISBN 978-287587014-8)

Model Resources: [Paper]

SageMaker Algorithm Description

The algorithm implements the model as described above with no changes. However, the algorithm defines the normality scores using the Gaussian log-likelihood instead of the likelihood.

Notes:

  • The algorithm splits the training data into two independent subsets: one subset is used for training the LSTM model, while the other subset is used for calculating the prediction errors to which the parameters of the multivariate Gaussian distribution are fitted. The (optional) validation data accepted by the algorithm is only used for scoring the model, i.e. for calculating the mean squared error (MSE) and mean absolute error (MAE) between the actual values of the time series in the validation dataset and their predicted values generated by the previously trained LSTM model.

  • The algorithm views the multivariate time series as different measurements on the same system. An anomaly is intended as an abnormal behavior of the entire system, not of a single individual measurement. As a result, the algorithm outputs only one normality score for each time step, representing the likelihood that the overall system is in a normal state at that time step. The algorithm can also be applied to a univariate time series (i.e. to a single time series). Consider fitting the model to each individual time series if the time series are not similar to each other and are not related, or if you need to identify the anomalies separately in each time series.

Training

The training algorithm has two input data channels: training and validation. The training channel is mandatory, while the validation channel is optional.

The training and validation datasets should be provided as CSV files and should only contain normal data (i.e. without anomalies). Each column of the CSV file represents a time series, while each row represents a time step. All the time series should have the same length and should not contain missing values. The CSV file should not contain any index column or column headers. See the sample input files train.csv and valid.csv.

See notebook.ipynb for an example of how to launch a training job.

Distributed Training

The algorithm supports multi-GPU training on a single instance, which is implemented through torch.nn.DataParallel. The algorithm does not support multi-node (or distributed) training across multiple instances.

Hyperparameters

The training algorithm takes as input the following hyperparameters:

  • context-length: int. The length of the input sequences.
  • prediction-length: int. The length of the output sequences.
  • sequence-stride: int. The period between consecutive output sequences.
  • num-layers: int. The number of LSTM layers.
  • hidden-size: int. The number of hidden units of each LSTM layer.
  • dropout: float. The dropout rate applied after each LSTM layer.
  • lr: float. The learning rate used for training.
  • batch-size: int. The batch size used for training.
  • epochs: int. The number of training epochs.

Metrics

The training algorithm logs the following metrics:

  • train_mse: float. Training mean squared error.
  • train_mae: float. Training mean absolute error.

If the validation channel is provided, the training algorithm also logs the following additional metrics:

  • valid_mse: float. Validation mean squared error.
  • valid_mae: float. Validation mean absolute error.

See notebook.ipynb for an example of how to launch a hyperparameter tuning job.

Inference

The inference algorithm takes as input a CSV file containing the time series. Each column of the CSV file represents a time series, while each row represents a time step. The CSV file should not contain any index column or column headers. All the time series should have the same length and should not contain missing values. See the sample input file test.csv.

The inference algorithm outputs the normality scores and the predicted values of the time series. The normality scores are included in the first column, while the predicted values of the time series are included in the subsequent columns. See the sample output files batch_predictions.csv and real_time_predictions.csv.

Note: The model predicts the time series sequence by sequence. For instance, if the context-length is set equal to 200, and the prediction-length is set equal to 100, then the first 200 data points (from 1 to 200) are used as input to predict the next 100 data points (from 201 to 300). As a result, the algorithm does not return the normality scores and predicted values of the first 200 data points, which are set to missing in the output CSV file.

See notebook.ipynb for an example of how to launch a batch transform job.

Endpoints

The algorithm supports only real-time inference endpoints. The inference image is too large to be uploaded to a serverless inference endpoint.

See notebook.ipynb for an example of how to deploy the model to an endpoint, invoke the endpoint and process the response.

Additional Resources: [Sample Notebook] [Blog Post]

References

  • P. Malhotra, L. Vig, G. Shroff and P. Agarwal, "Long Short Term Memory Networks for Anomaly Detection in Time Series," In Esann, vol. 2015, p. 89, 2015, ISBN 978-287587014-8.