Annual update, a lot of information added and updated, resolves #39, r…

…esolves #51
waico · Aug 9, 2023 · 80bac4a · 80bac4a
1 parent 6a3e1e8
commit 80bac4a
Show file tree

Hide file tree

Showing 18 changed files with 1,620 additions and 1,565 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,16 +1,18 @@
 .DS_Store
 data/.DS_Store
+src/.DS_Store
+docs/.DS_Store
+notebooks/.DS_Store
 notebooks/venv/
 notebooks/*.h5
 
 # Byte-compiled / optimized / DLL files
 __pycache__/
+src/__pycache__/
+notebooks/__pycache__/
 *.py[cod]
 *$py.class
 
-# C extensions
-*.so
-
 # Distribution / packaging
 .Python
 build/
@@ -56,31 +58,10 @@ coverage.xml
 .hypothesis/
 .pytest_cache/
 
-# Translations
-*.mo
-*.pot
-
-# Django stuff:
-*.log
-local_settings.py
-db.sqlite3
-db.sqlite3-journal
-
-# Flask stuff:
-instance/
-.webassets-cache
-
-# Scrapy stuff:
-.scrapy
-
-# Sphinx documentation
-docs/_build/
-
-# PyBuilder
-target/
-
 # Jupyter Notebook
 .ipynb_checkpoints
+notebooks/.ipynb_checkpoints
+src/.ipynb_checkpoints
 
 # IPython
 profile_default/
@@ -89,46 +70,14 @@ ipython_config.py
 # pyenv
 .python-version
 
-# pipenv
-#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
-#   However, in case of collaboration, if having platform-specific dependencies or dependencies
-#   having no cross-platform support, pipenv may install dependencies that don't work, or not
-#   install all needed dependencies.
-#Pipfile.lock
-
 # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 __pypackages__/
 
-# Celery stuff
-celerybeat-schedule
-celerybeat.pid
-
-# SageMath parsed files
-*.sage.py
-
 # Environments
 .env
 .venv
 env/
 venv/
 ENV/
 env.bak/
-venv.bak/
-
-# Spyder project settings
-.spyderproject
-.spyproject
-
-# Rope project settings
-.ropeproject
-
-# mkdocs documentation
-/site
-
-# mypy
-.mypy_cache/
-.dmypy.json
-dmypy.json
-
-# Pyre type checker
-.pyre/
+venv.bak/
diff --git a/README.md b/README.md
@@ -1,28 +1,28 @@
 ![skab](docs/pictures/skab.png)
 
-❗️❗️❗️**The testbed is under repair right now. Unfortunately, we can't tell exactly when it will be ready and we be able to continue data collection. Information about it will be in the repository. Sorry for the delay.**
+🛠🛠🛠**The testbed is under repair right now. Unfortunately, we can't tell exactly when it will be ready and we be able to continue data collection. Information about it will be in the repository. Sorry for the delay.**
 
 ❗️❗️❗️The current version of SKAB (v0.9) contains 34 datasets with collective anomalies. But the update to v1.0 will contain 300+ additional files with point and collective anomalies. It will make SKAB one of the largest changepoint-containing benchmarks, especially in the technical field.
 
 # About SKAB [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/waico/SKAB/graphs/commit-activity) [![DOI](https://img.shields.io/badge/DOI-10.34740/kaggle/dsv/1693952-blue.svg)](https://doi.org/10.34740/KAGGLE/DSV/1693952) [![License: GPL v3.0](https://img.shields.io/badge/License-GPL%20v3.0-green.svg)](https://www.gnu.org/licenses/gpl-3.0.html)
 We propose the [Skoltech](https://www.skoltech.ru/en) Anomaly Benchmark (SKAB) designed for evaluating the anomaly detection algorithms. SKAB allows working with two main problems (there are two markups for anomalies):
-1. Outlier detection (anomalies considered and marked up as single-point anomalies);
-2. Changepoint detection (anomalies considered and marked up as collective anomalies).
+1. Outlier detection (anomalies considered and marked up as single-point anomalies)
+2. Changepoint detection (anomalies considered and marked up as collective anomalies)
 
 SKAB consists of the following artifacts:
-1. [Datasets](#datasets);
-2. [Leaderboards](#leaderboards) for oultier detection and changepoint detection problems;
-3. Python [modules](https://github.com/waico/SKAB/blob/master/utils/evaluating.py) for algorithms’ evaluation;
-4. Python [notebooks](#notebooks) with anomaly detection algorithms.
+1. [Datasets](#datasets)
+2. [Leaderboards](#leaderboards) for oultier detection and changepoint detection problems
+3. Python modules for algorithms’ evaluation (now evaluation modules are being imported from [TSAD](https://github.com/waico/tsad) framework, while the details regarding the evaluation process are presented [here](https://github.com/waico/tsad/blob/main/examples/Evaluating.ipynb))
+4. Python [modules](#src) with algorithms’ implementation
+5. Python [notebooks](#notebooks) with anomaly detection pipeline implementation for various algorithms
 
-The IIot testbed system is located in the Skolkovo Institute of Science and Technology (Skoltech).
-All the details regarding the testbed and the experimenting process are presented in the following artifacts:
-- Position paper (*currently submitted for publication*);
-- Slides about the project: [in English](https://drive.google.com/open?id=1dHUevwPp6ftQCEKnRgB4KMp9oLBMSiDM), [in Russian](https://drive.google.com/file/d/1gThPCNbEaIxhENLm-WTFGO_9PU1Wdwjq/view?usp=share_link).
+All the details about SKAB are presented in the following artifacts:
+- Position paper (*currently submitted for publication*)
+- Talk about the project: [English](https://youtu.be/hjzuKeNYUho) version and [Russian](https://www.youtube.com/watch?v=VLmmYGc4v2c) version
+- Slides about the project: [English](https://drive.google.com/open?id=1dHUevwPp6ftQCEKnRgB4KMp9oLBMSiDM) version and [Russian](https://drive.google.com/file/d/1gThPCNbEaIxhENLm-WTFGO_9PU1Wdwjq/view?usp=share_link) version
 
-<a name="datasets"></a>
 # Datasets
-The SKAB v0.9 corpus contains 35 individual data files in .csv format. Each file represents a single experiment and contains a single anomaly. The dataset represents a multivariate time series collected from the sensors installed on the testbed. The [data](data/) folder contains datasets from the benchmark. The structure of the data folder is presented in the [structure](./data/README.md) file. Columns in each data file are following:
+The SKAB v0.9 corpus contains 35 individual data files in .csv format (datasets). The [data](data/) folder contains datasets from the benchmark. The structure of the data folder is presented in the [structure](./data/README.md) file. Each dataset represents a single experiment and contains a single anomaly. The datasets represent a multivariate time series collected from the sensors installed on the testbed. Columns in each data file are following:
 - `datetime` - Represents dates and times of the moment when the value is written to the database (YYYY-MM-DD hh:mm:ss)
 - `Accelerometer1RMS` - Shows a vibration acceleration (Amount of g units)
 - `Accelerometer2RMS` - Shows a vibration acceleration (Amount of g units)
@@ -35,14 +35,12 @@ The SKAB v0.9 corpus contains 35 individual data files in .csv format. Each file
 - `anomaly` - Shows if the point is anomalous (0 or 1)
 - `changepoint` - Shows if the point is a changepoint for collective anomalies (0 or 1)
 
-Exploratory Data Analysis (EDA) for SKAB is presented [here](https://github.com/waico/SKAB/blob/master/notebooks/EDA.ipynb). 
+Exploratory Data Analysis (EDA) for SKAB is presented [here](https://github.com/waico/SKAB/blob/master/notebooks/EDA.ipynb). Russian version of EDA is available on [kaggle](https://www.kaggle.com/newintown/eda-example).
 
-Russian version of EDA is also available at [kaggle](https://www.kaggle.com/newintown/eda-example). 
-
-<a name="leaderboards"></a>
+ℹ️We have also made a *SKAB teaser* that is a small dataset collected separately but from the same testbed. SKAB teaser is made just for learning/teaching purposes and contains only 4 collective anomalies. All the information is available on [kaggle](https://www.kaggle.com/datasets/yuriykatser/skoltech-anomaly-benchmark-skab-teaser).
 
 # Leaderboards
-Here we propose the leaderboards for SKAB v0.9 both for outlier and changepoint detection problems. You can also present and evaluate your algorithm using SKAB on [kaggle](https://www.kaggle.com/yuriykatser/skoltech-anomaly-benchmark-skab). Leaderboards are also available at paperswithcode.com: [CPD problem](https://paperswithcode.com/sota/change-point-detection-on-skab).
+Here we propose the leaderboards for SKAB v0.9 for both outlier and changepoint detection problems. You can also present and evaluate your algorithm using SKAB on [kaggle](https://www.kaggle.com/yuriykatser/skoltech-anomaly-benchmark-skab). Leaderboards are also available at paperswithcode.com: [CPD problem](https://paperswithcode.com/sota/change-point-detection-on-skab).
 
 ❗️All results (excl. ruptures and CPDE) are calculated for out-of-box algorithms without any hyperparameters tuning.
 
@@ -54,12 +52,12 @@ Perfect detector | 1 | 0 | 0
 Conv-AE |***0.79*** | 13.69 | ***17.77***
 MSET |0.73 | 20.82 | 20.08
 LSTM-AE |0.68 | 14.24 | 35.56
-T-squared+Q (PCA) | 0.67 | 13.95 | 36.32
-LSTM | 0.64 | 15.4 | 39.93
+T-squared+Q (PCA-based) | 0.67 | 13.95 | 36.32
+Vanilla LSTM | 0.64 | 15.4 | 39.93
 MSCRED | 0.64 | 13.56 | 41.16
 LSTM-VAE | 0.56 | 9.13 | 55.03
 T-squared | 0.56 | 12.14 | 52.56
-Autoencoder | 0.45 | 7.56 | 66.57
+Vanilla AE | 0.45 | 7.56 | 66.57
 Isolation forest | 0.4 | ***6.86*** | 72.09
 Null detector | 0  | 0 | 100
 
@@ -72,15 +70,15 @@ Null detector | 0  | 0 | 100
 |Perfect detector | 100 | 100 | 100 |
 |Isolation forest | ***37.53*** | 17.09 | ***45.02***|
 |MSCRED | 28.74 | ***23.43*** | 31.21|
-|LSTM | 27.09 | 11.06 | 32.68|
-|T-squared+Q (PCA) | 26.71 | 22.42 | 28.32|
+|Vanilla LSTM | 27.09 | 11.06 | 32.68|
+|T-squared+Q (PCA-based) | 26.71 | 22.42 | 28.32|
 |ruptures** | 24.1 | 21.69 | 25.04|
 |CPDE*** | 23.07 | 20.52 | 24.35|
 |LSTM-AE |22.12 | 20.01 | 23.21|
 |LSTM-VAE | 19.17 | 15.39 | 20.98|
 |T-squared | 17.87 | 3.44 | 23.2|
 |ArimaFD | 07.67 | 01.97 | 11.04 |
-|Autoencoder | 15.59 | 0.78 | 20.91|
+|Vanilla AE | 15.59 | 0.78 | 20.91|
 |MSET | 12.71 | 11.04 | 13.6|
 |Conv-AE | 10.09 | 8.62 | 10.83|
 |Null detector | 0 | 0 | 0|
@@ -89,26 +87,25 @@ Null detector | 0  | 0 | 100
 
 *** The best aggregation function (shown) is WeightedSum with MinAbs scaling function.
 
-<a name="notebooks"></a>
 # Notebooks
-The [notebooks](notebooks/) folder contains python notebooks with the code for the proposed leaderboard results reproducing. This folder also contains short description of the algorithms and references to papers and code.
-
-We have calculated the results for following common anomaly detection algorithms:
-- Hotelling's T-squared statistics;
-- Hotelling's T-squared statistics + Q statistics based on PCA;
-- Isolation forest;
-- LSTM-based NN (LSTM);
-- Feed-Forward Autoencoder;
-- LSTM Autoencoder (LSTM-AE);
-- LSTM Variational Autoencoder (LSTM-VAE);
-- Convolutional Autoencoder (Conv-AE);
-- Multi-Scale Convolutional Recurrent Encoder-Decoder (MSCRED);
-- Multivariate State Estimation Technique (MSET).
-
-Additionally on the leaderboard were shown the results of the following algorithms:
-- [ArimaFD](https://github.com/waico/arimafd);
-- [ruptures](https://github.com/deepcharles/ruptures) changepoint detection (CPD) algorithms;
-- ruptures-based [changepoint detection ensemble (CPDE) algorithms](https://github.com/YKatser/CPDE).
+The [notebooks](notebooks/) folder contains jupyter notebooks with the code for the proposed leaderboard results reproducing. We have calculated the results for following commonly known anomaly detection algorithms:
+- Isolation forest - *Outlier detection algorithm based on Random forest concept*
+- Vanilla LSTM - *NN with LSTM layer*
+- Vanilla AE - *Feed-Forward Autoencoder*
+- LSTM-AE - *LSTM Autoencoder*
+- LSTM-VAE - *LSTM Variational Autoencoder*
+- Conv-AE - *Convolutional Autoencoder*
+- MSCRED - *Multi-Scale Convolutional Recurrent Encoder-Decoder*
+- MSET - *Multivariate State Estimation Technique*
+
+Additionally on the leaderboard were shown the externally calculated results of the following algorithms:
+- [ArimaFD](https://github.com/waico/arimafd) - *ARIMA-based fault detection algorithm*
+- [T-squared](http://github.com/YKatser/ControlCharts/tree/main/examples) - *Hotelling's T-squared statistics*
+- [T-squared+Q (PCA-based)](http://github.com/YKatser/ControlCharts/tree/main/examples) - *Hotelling's T-squared statistics + Q statistics based on PCA*
+- [ruptures](https://github.com/deepcharles/ruptures) - *Changepoint detection (CPD) algorithms from ruptures package*
+- [CPDE](https://github.com/YKatser/CPDE) - *Ruptures-based changepoint detection ensemble (CPDE) algorithms*
+
+Details regarding the algorithms, including short description, references to scientific papers and code of the initial implementation is available in [this readme](https://github.com/waico/SKAB/tree/master/notebooks#anomaly-detection-algorithms).
 
 # Citation
 Please cite our project in your publications if it helps your research.
@@ -138,5 +135,6 @@ SKAB is acknowledged by some ML resources.
   - [paperswithcode.com](https://paperswithcode.com/dataset/skab)
   - [Google datasets](https://datasetsearch.research.google.com/search?query=skoltech%20anomaly%20benchmark&docid=IIIE4VWbqUKszygyAAAAAA%3D%3D)
   - [Industrial ML Datasets](https://github.com/nicolasj92/industrial-ml-datasets)
+  - etc.
 
 </details>
diff --git a/data/README.md b/data/README.md
@@ -3,12 +3,12 @@
       ├── Load data.ipynb         # Jupyter Notebook to load all data
       ├── anomaly-free         
       │   └── anomaly-free.csv     # Data obtained from the experiments with normal mode
-      ├── valve1                  # Data obtained from the experiments with closing the valve at the outlet of the flow from the pump.
+      ├── valve2                  # Data obtained from the experiments with closing the valve at the outlet of the flow from the pump.
       │   ├── 1.csv            
       │   ├── 2.csv            
       │   ├── 3.csv            
       │   └── 4.csv            	
-      ├── valve2                  # Data obtained from the experiments with closing the valve at the flow inlet to the pump.
+      ├── valve1                  # Data obtained from the experiments with closing the valve at the flow inlet to the pump.
       │   ├── 1.csv            
       │   ├── 2.csv            
       │   ├── 3.csv            

diff --git a/docs/pictures/skab.png b/docs/pictures/skab.png
diff --git a/docs/pictures/testbed.png b/docs/pictures/testbed.png
diff --git a/notebooks/ArimaFD.ipynb b/notebooks/ArimaFD.ipynb
@@ -257,12 +257,12 @@
    ],
    "source": [
     "# dataset characteristics printing\n",
-    "print(f'A number of datasets in the SkAB v1.0: {len(list_of_df)}\\n')\n",
+    "print(f'A number of datasets in the SKAB v0.9: {len(list_of_df)}\\n')\n",
     "print(f'Shape of the random dataset: {list_of_df[0].shape}\\n')\n",
     "n_cp = sum([len(df[df.changepoint==1.]) for df in list_of_df])\n",
     "n_outlier = sum([len(df[df.anomaly==1.]) for df in list_of_df])\n",
-    "print(f'A number of changepoints in the SkAB v1.0: {n_cp}\\n')\n",
-    "print(f'A number of outliers in the SkAB v1.0: {n_outlier}\\n')\n",
+    "print(f'A number of changepoints in the SKAB v0.9: {n_cp}\\n')\n",
+    "print(f'A number of outliers in the SKAB v0.9: {n_outlier}\\n')\n",
     "print(f'Head of the random dataset:')\n",
     "display(list_of_df[0].head())"
    ]
@@ -568,7 +568,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.7"
+   "version": "3.9.12"
   },
   "toc": {
    "base_numbering": 1,

diff --git a/notebooks/Conv-AE.ipynb b/notebooks/Conv-AE.ipynb
@@ -645,7 +645,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -659,7 +659,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.2"
+   "version": "3.9.12"
   }
  },
  "nbformat": 4,

diff --git a/notebooks/LSTM-AE.ipynb b/notebooks/LSTM-AE.ipynb
@@ -648,7 +648,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -662,7 +662,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.2"
+   "version": "3.9.12"
   }
  },
  "nbformat": 4,

diff --git a/notebooks/LSTM-VAE.ipynb b/notebooks/LSTM-VAE.ipynb
@@ -736,7 +736,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -750,7 +750,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.2"
+   "version": "3.9.12"
   }
  },
  "nbformat": 4,

diff --git a/notebooks/README.md b/notebooks/README.md
@@ -4,15 +4,15 @@
 Hotelling's statistic is one of the most popular statistical process control techniques. It is based on the Mahalanobis distance.
 Generally, it measures the distance between the new vector of values and the previously defined vector of normal values additionally using variances.
 
-[[notebook]](https://github.com/waico/SKAB/blob/master/notebooks/hotelling.ipynb) [[paper]](https://www.semanticscholar.org/paper/Multivariate-Quality-Control-illustrated-by-the-air-Hotelling/529ba6c1a80b684d2f704a7565da305bb84f14e8)
+[[notebook]](https://github.com/YKatser/ControlCharts/blob/main/examples/t2_SKAB.ipynb) [[paper]](https://www.semanticscholar.org/paper/Multivariate-Quality-Control-illustrated-by-the-air-Hotelling/529ba6c1a80b684d2f704a7565da305bb84f14e8)
 
 ### Hotelling's T-squared statistic + Q statistic (SPE index) based on PCA
 The combined index is based on PCA.
 Hotelling’s T-squared statistic measures variations in the principal component subspace.
 Q statistic measures the projection of the sample vector on the residual subspace.
 To avoid using two separated indicators (Hotelling's T-squared and Q statistics) for the process monitoring, we use a combined one based on logical or.
 
-[[notebook]](https://github.com/waico/SKAB/blob/master/notebooks/hotelling_q.ipynb) [[paper]](https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/abs/10.1002/cem.800)
+[[notebook]](https://github.com/YKatser/ControlCharts/blob/main/examples/t2_with_q_SKAB.ipynb) [[paper]](https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/abs/10.1002/cem.800)
 
 ### Isolation Forest
 Isolation Forest or iForest builds an ensemble of iTrees for a given data set, then anomalies are those instances which have short average path lengths on the iTrees.