Skip to content

Commit

Permalink
Annual update, a lot of information added and updated, resolves #39, r…
Browse files Browse the repository at this point in the history
…esolves #51
  • Loading branch information
YKatser committed Aug 9, 2023
1 parent 6a3e1e8 commit 80bac4a
Show file tree
Hide file tree
Showing 18 changed files with 1,620 additions and 1,565 deletions.
67 changes: 8 additions & 59 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
.DS_Store
data/.DS_Store
src/.DS_Store
docs/.DS_Store
notebooks/.DS_Store
notebooks/venv/
notebooks/*.h5

# Byte-compiled / optimized / DLL files
__pycache__/
src/__pycache__/
notebooks/__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
Expand Down Expand Up @@ -56,31 +58,10 @@ coverage.xml
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints
notebooks/.ipynb_checkpoints
src/.ipynb_checkpoints

# IPython
profile_default/
Expand All @@ -89,46 +70,14 @@ ipython_config.py
# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/
venv.bak/
84 changes: 41 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,28 @@
![skab](docs/pictures/skab.png)

❗️❗️❗️**The testbed is under repair right now. Unfortunately, we can't tell exactly when it will be ready and we be able to continue data collection. Information about it will be in the repository. Sorry for the delay.**
🛠🛠🛠**The testbed is under repair right now. Unfortunately, we can't tell exactly when it will be ready and we be able to continue data collection. Information about it will be in the repository. Sorry for the delay.**

❗️❗️❗️The current version of SKAB (v0.9) contains 34 datasets with collective anomalies. But the update to v1.0 will contain 300+ additional files with point and collective anomalies. It will make SKAB one of the largest changepoint-containing benchmarks, especially in the technical field.

# About SKAB [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/waico/SKAB/graphs/commit-activity) [![DOI](https://img.shields.io/badge/DOI-10.34740/kaggle/dsv/1693952-blue.svg)](https://doi.org/10.34740/KAGGLE/DSV/1693952) [![License: GPL v3.0](https://img.shields.io/badge/License-GPL%20v3.0-green.svg)](https://www.gnu.org/licenses/gpl-3.0.html)
We propose the [Skoltech](https://www.skoltech.ru/en) Anomaly Benchmark (SKAB) designed for evaluating the anomaly detection algorithms. SKAB allows working with two main problems (there are two markups for anomalies):
1. Outlier detection (anomalies considered and marked up as single-point anomalies);
2. Changepoint detection (anomalies considered and marked up as collective anomalies).
1. Outlier detection (anomalies considered and marked up as single-point anomalies)
2. Changepoint detection (anomalies considered and marked up as collective anomalies)

SKAB consists of the following artifacts:
1. [Datasets](#datasets);
2. [Leaderboards](#leaderboards) for oultier detection and changepoint detection problems;
3. Python [modules](https://github.com/waico/SKAB/blob/master/utils/evaluating.py) for algorithms’ evaluation;
4. Python [notebooks](#notebooks) with anomaly detection algorithms.
1. [Datasets](#datasets)
2. [Leaderboards](#leaderboards) for oultier detection and changepoint detection problems
3. Python modules for algorithms’ evaluation (now evaluation modules are being imported from [TSAD](https://github.com/waico/tsad) framework, while the details regarding the evaluation process are presented [here](https://github.com/waico/tsad/blob/main/examples/Evaluating.ipynb))
4. Python [modules](#src) with algorithms’ implementation
5. Python [notebooks](#notebooks) with anomaly detection pipeline implementation for various algorithms

The IIot testbed system is located in the Skolkovo Institute of Science and Technology (Skoltech).
All the details regarding the testbed and the experimenting process are presented in the following artifacts:
- Position paper (*currently submitted for publication*);
- Slides about the project: [in English](https://drive.google.com/open?id=1dHUevwPp6ftQCEKnRgB4KMp9oLBMSiDM), [in Russian](https://drive.google.com/file/d/1gThPCNbEaIxhENLm-WTFGO_9PU1Wdwjq/view?usp=share_link).
All the details about SKAB are presented in the following artifacts:
- Position paper (*currently submitted for publication*)
- Talk about the project: [English](https://youtu.be/hjzuKeNYUho) version and [Russian](https://www.youtube.com/watch?v=VLmmYGc4v2c) version
- Slides about the project: [English](https://drive.google.com/open?id=1dHUevwPp6ftQCEKnRgB4KMp9oLBMSiDM) version and [Russian](https://drive.google.com/file/d/1gThPCNbEaIxhENLm-WTFGO_9PU1Wdwjq/view?usp=share_link) version

<a name="datasets"></a>
# Datasets
The SKAB v0.9 corpus contains 35 individual data files in .csv format. Each file represents a single experiment and contains a single anomaly. The dataset represents a multivariate time series collected from the sensors installed on the testbed. The [data](data/) folder contains datasets from the benchmark. The structure of the data folder is presented in the [structure](./data/README.md) file. Columns in each data file are following:
The SKAB v0.9 corpus contains 35 individual data files in .csv format (datasets). The [data](data/) folder contains datasets from the benchmark. The structure of the data folder is presented in the [structure](./data/README.md) file. Each dataset represents a single experiment and contains a single anomaly. The datasets represent a multivariate time series collected from the sensors installed on the testbed. Columns in each data file are following:
- `datetime` - Represents dates and times of the moment when the value is written to the database (YYYY-MM-DD hh:mm:ss)
- `Accelerometer1RMS` - Shows a vibration acceleration (Amount of g units)
- `Accelerometer2RMS` - Shows a vibration acceleration (Amount of g units)
Expand All @@ -35,14 +35,12 @@ The SKAB v0.9 corpus contains 35 individual data files in .csv format. Each file
- `anomaly` - Shows if the point is anomalous (0 or 1)
- `changepoint` - Shows if the point is a changepoint for collective anomalies (0 or 1)

Exploratory Data Analysis (EDA) for SKAB is presented [here](https://github.com/waico/SKAB/blob/master/notebooks/EDA.ipynb).
Exploratory Data Analysis (EDA) for SKAB is presented [here](https://github.com/waico/SKAB/blob/master/notebooks/EDA.ipynb). Russian version of EDA is available on [kaggle](https://www.kaggle.com/newintown/eda-example).

Russian version of EDA is also available at [kaggle](https://www.kaggle.com/newintown/eda-example).

<a name="leaderboards"></a>
ℹ️We have also made a *SKAB teaser* that is a small dataset collected separately but from the same testbed. SKAB teaser is made just for learning/teaching purposes and contains only 4 collective anomalies. All the information is available on [kaggle](https://www.kaggle.com/datasets/yuriykatser/skoltech-anomaly-benchmark-skab-teaser).

# Leaderboards
Here we propose the leaderboards for SKAB v0.9 both for outlier and changepoint detection problems. You can also present and evaluate your algorithm using SKAB on [kaggle](https://www.kaggle.com/yuriykatser/skoltech-anomaly-benchmark-skab). Leaderboards are also available at paperswithcode.com: [CPD problem](https://paperswithcode.com/sota/change-point-detection-on-skab).
Here we propose the leaderboards for SKAB v0.9 for both outlier and changepoint detection problems. You can also present and evaluate your algorithm using SKAB on [kaggle](https://www.kaggle.com/yuriykatser/skoltech-anomaly-benchmark-skab). Leaderboards are also available at paperswithcode.com: [CPD problem](https://paperswithcode.com/sota/change-point-detection-on-skab).

❗️All results (excl. ruptures and CPDE) are calculated for out-of-box algorithms without any hyperparameters tuning.

Expand All @@ -54,12 +52,12 @@ Perfect detector | 1 | 0 | 0
Conv-AE |***0.79*** | 13.69 | ***17.77***
MSET |0.73 | 20.82 | 20.08
LSTM-AE |0.68 | 14.24 | 35.56
T-squared+Q (PCA) | 0.67 | 13.95 | 36.32
LSTM | 0.64 | 15.4 | 39.93
T-squared+Q (PCA-based) | 0.67 | 13.95 | 36.32
Vanilla LSTM | 0.64 | 15.4 | 39.93
MSCRED | 0.64 | 13.56 | 41.16
LSTM-VAE | 0.56 | 9.13 | 55.03
T-squared | 0.56 | 12.14 | 52.56
Autoencoder | 0.45 | 7.56 | 66.57
Vanilla AE | 0.45 | 7.56 | 66.57
Isolation forest | 0.4 | ***6.86*** | 72.09
Null detector | 0 | 0 | 100

Expand All @@ -72,15 +70,15 @@ Null detector | 0 | 0 | 100
|Perfect detector | 100 | 100 | 100 |
|Isolation forest | ***37.53*** | 17.09 | ***45.02***|
|MSCRED | 28.74 | ***23.43*** | 31.21|
|LSTM | 27.09 | 11.06 | 32.68|
|T-squared+Q (PCA) | 26.71 | 22.42 | 28.32|
|Vanilla LSTM | 27.09 | 11.06 | 32.68|
|T-squared+Q (PCA-based) | 26.71 | 22.42 | 28.32|
|ruptures** | 24.1 | 21.69 | 25.04|
|CPDE*** | 23.07 | 20.52 | 24.35|
|LSTM-AE |22.12 | 20.01 | 23.21|
|LSTM-VAE | 19.17 | 15.39 | 20.98|
|T-squared | 17.87 | 3.44 | 23.2|
|ArimaFD | 07.67 | 01.97 | 11.04 |
|Autoencoder | 15.59 | 0.78 | 20.91|
|Vanilla AE | 15.59 | 0.78 | 20.91|
|MSET | 12.71 | 11.04 | 13.6|
|Conv-AE | 10.09 | 8.62 | 10.83|
|Null detector | 0 | 0 | 0|
Expand All @@ -89,26 +87,25 @@ Null detector | 0 | 0 | 100

*** The best aggregation function (shown) is WeightedSum with MinAbs scaling function.

<a name="notebooks"></a>
# Notebooks
The [notebooks](notebooks/) folder contains python notebooks with the code for the proposed leaderboard results reproducing. This folder also contains short description of the algorithms and references to papers and code.

We have calculated the results for following common anomaly detection algorithms:
- Hotelling's T-squared statistics;
- Hotelling's T-squared statistics + Q statistics based on PCA;
- Isolation forest;
- LSTM-based NN (LSTM);
- Feed-Forward Autoencoder;
- LSTM Autoencoder (LSTM-AE);
- LSTM Variational Autoencoder (LSTM-VAE);
- Convolutional Autoencoder (Conv-AE);
- Multi-Scale Convolutional Recurrent Encoder-Decoder (MSCRED);
- Multivariate State Estimation Technique (MSET).

Additionally on the leaderboard were shown the results of the following algorithms:
- [ArimaFD](https://github.com/waico/arimafd);
- [ruptures](https://github.com/deepcharles/ruptures) changepoint detection (CPD) algorithms;
- ruptures-based [changepoint detection ensemble (CPDE) algorithms](https://github.com/YKatser/CPDE).
The [notebooks](notebooks/) folder contains jupyter notebooks with the code for the proposed leaderboard results reproducing. We have calculated the results for following commonly known anomaly detection algorithms:
- Isolation forest - *Outlier detection algorithm based on Random forest concept*
- Vanilla LSTM - *NN with LSTM layer*
- Vanilla AE - *Feed-Forward Autoencoder*
- LSTM-AE - *LSTM Autoencoder*
- LSTM-VAE - *LSTM Variational Autoencoder*
- Conv-AE - *Convolutional Autoencoder*
- MSCRED - *Multi-Scale Convolutional Recurrent Encoder-Decoder*
- MSET - *Multivariate State Estimation Technique*

Additionally on the leaderboard were shown the externally calculated results of the following algorithms:
- [ArimaFD](https://github.com/waico/arimafd) - *ARIMA-based fault detection algorithm*
- [T-squared](http://github.com/YKatser/ControlCharts/tree/main/examples) - *Hotelling's T-squared statistics*
- [T-squared+Q (PCA-based)](http://github.com/YKatser/ControlCharts/tree/main/examples) - *Hotelling's T-squared statistics + Q statistics based on PCA*
- [ruptures](https://github.com/deepcharles/ruptures) - *Changepoint detection (CPD) algorithms from ruptures package*
- [CPDE](https://github.com/YKatser/CPDE) - *Ruptures-based changepoint detection ensemble (CPDE) algorithms*

Details regarding the algorithms, including short description, references to scientific papers and code of the initial implementation is available in [this readme](https://github.com/waico/SKAB/tree/master/notebooks#anomaly-detection-algorithms).

# Citation
Please cite our project in your publications if it helps your research.
Expand Down Expand Up @@ -138,5 +135,6 @@ SKAB is acknowledged by some ML resources.
- [paperswithcode.com](https://paperswithcode.com/dataset/skab)
- [Google datasets](https://datasetsearch.research.google.com/search?query=skoltech%20anomaly%20benchmark&docid=IIIE4VWbqUKszygyAAAAAA%3D%3D)
- [Industrial ML Datasets](https://github.com/nicolasj92/industrial-ml-datasets)
- etc.

</details>
4 changes: 2 additions & 2 deletions data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
├── Load data.ipynb # Jupyter Notebook to load all data
├── anomaly-free
│ └── anomaly-free.csv # Data obtained from the experiments with normal mode
├── valve1 # Data obtained from the experiments with closing the valve at the outlet of the flow from the pump.
├── valve2 # Data obtained from the experiments with closing the valve at the outlet of the flow from the pump.
│ ├── 1.csv
│ ├── 2.csv
│ ├── 3.csv
│ └── 4.csv
├── valve2 # Data obtained from the experiments with closing the valve at the flow inlet to the pump.
├── valve1 # Data obtained from the experiments with closing the valve at the flow inlet to the pump.
│ ├── 1.csv
│ ├── 2.csv
│ ├── 3.csv
Expand Down
Binary file modified docs/pictures/skab.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/pictures/testbed.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 4 additions & 4 deletions notebooks/ArimaFD.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -257,12 +257,12 @@
],
"source": [
"# dataset characteristics printing\n",
"print(f'A number of datasets in the SkAB v1.0: {len(list_of_df)}\\n')\n",
"print(f'A number of datasets in the SKAB v0.9: {len(list_of_df)}\\n')\n",
"print(f'Shape of the random dataset: {list_of_df[0].shape}\\n')\n",
"n_cp = sum([len(df[df.changepoint==1.]) for df in list_of_df])\n",
"n_outlier = sum([len(df[df.anomaly==1.]) for df in list_of_df])\n",
"print(f'A number of changepoints in the SkAB v1.0: {n_cp}\\n')\n",
"print(f'A number of outliers in the SkAB v1.0: {n_outlier}\\n')\n",
"print(f'A number of changepoints in the SKAB v0.9: {n_cp}\\n')\n",
"print(f'A number of outliers in the SKAB v0.9: {n_outlier}\\n')\n",
"print(f'Head of the random dataset:')\n",
"display(list_of_df[0].head())"
]
Expand Down Expand Up @@ -568,7 +568,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
"version": "3.9.12"
},
"toc": {
"base_numbering": 1,
Expand Down
4 changes: 2 additions & 2 deletions notebooks/Conv-AE.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -645,7 +645,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -659,7 +659,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
"version": "3.9.12"
}
},
"nbformat": 4,
Expand Down
4 changes: 2 additions & 2 deletions notebooks/LSTM-AE.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -648,7 +648,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -662,7 +662,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
"version": "3.9.12"
}
},
"nbformat": 4,
Expand Down
4 changes: 2 additions & 2 deletions notebooks/LSTM-VAE.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -736,7 +736,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -750,7 +750,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
"version": "3.9.12"
}
},
"nbformat": 4,
Expand Down
4 changes: 2 additions & 2 deletions notebooks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@
Hotelling's statistic is one of the most popular statistical process control techniques. It is based on the Mahalanobis distance.
Generally, it measures the distance between the new vector of values and the previously defined vector of normal values additionally using variances.

[[notebook]](https://github.com/waico/SKAB/blob/master/notebooks/hotelling.ipynb) [[paper]](https://www.semanticscholar.org/paper/Multivariate-Quality-Control-illustrated-by-the-air-Hotelling/529ba6c1a80b684d2f704a7565da305bb84f14e8)
[[notebook]](https://github.com/YKatser/ControlCharts/blob/main/examples/t2_SKAB.ipynb) [[paper]](https://www.semanticscholar.org/paper/Multivariate-Quality-Control-illustrated-by-the-air-Hotelling/529ba6c1a80b684d2f704a7565da305bb84f14e8)

### Hotelling's T-squared statistic + Q statistic (SPE index) based on PCA
The combined index is based on PCA.
Hotelling’s T-squared statistic measures variations in the principal component subspace.
Q statistic measures the projection of the sample vector on the residual subspace.
To avoid using two separated indicators (Hotelling's T-squared and Q statistics) for the process monitoring, we use a combined one based on logical or.

[[notebook]](https://github.com/waico/SKAB/blob/master/notebooks/hotelling_q.ipynb) [[paper]](https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/abs/10.1002/cem.800)
[[notebook]](https://github.com/YKatser/ControlCharts/blob/main/examples/t2_with_q_SKAB.ipynb) [[paper]](https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/abs/10.1002/cem.800)

### Isolation Forest
Isolation Forest or iForest builds an ensemble of iTrees for a given data set, then anomalies are those instances which have short average path lengths on the iTrees.
Expand Down
Loading

0 comments on commit 80bac4a

Please sign in to comment.