Machine Learning Benchmarks contains implementations of machine learning algorithms across data analytics frameworks. Scikit-learn_bench can be extended to add new frameworks and algorithms. It currently support the scikit-learn, DAAL4PY, cuML, and XGBoost frameworks for commonly used machine learning algorithms.
We publish blogs on Medium, so follow us to learn tips and tricks for more efficient data analysis. Here are our latest blogs:
- Intel Gives Scikit-Learn the Performance Boost Data Scientists Need
- From Hours to Minutes: 600x Faster SVM
- Improve the Performance of XGBoost and LightGBM Inference
- Accelerate Kaggle Challenges Using Intel AI Analytics Toolkit
- Accelerate Your scikit-learn Applications
- Optimizing XGBoost Training Performance
- Accelerate Linear Models for Machine Learning
- Accelerate K-Means Clustering
- Fast Gradient Boosting Tree Inference
- Prerequisites
- How to create conda environment for benchmarking
- How to enable daal4py patching for scikit-learn benchmarks
- Running Python benchmarks with runner script
- Supported algorithms
- Algorithms parameters
Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework.
conda create -n bench -c intel python=3.7 scikit-learn daal4py pandas
conda create -n bench -c intel python=3.7 scikit-learn daal4py pandas
conda create -n bench -c rapidsai -c conda-forge python=3.7 cuml pandas cudf
conda create -n bench -c conda-forge python=3.7 xgboost pandas
Run python runner.py --configs configs/config_example.json [--output-file result.json --verbose INFO --report]
to launch benchmarks.
runner options:
configs
: configuration files pathsno-intel-optimized
: use no intel optimized version. Now avalible for scikit-learn benchmarks. Default is intel-optimized version.output-file
: output file name for result benchmarks. Default isresult.json
report
: create an Excel report based on benchmarks results. Need libraryopenpyxl
.dummy-run
: run configuration parser and datasets generation without benchmarks running.verbose
: WARNING, INFO, DEBUG. print additional information during benchmarks running. Default is INFO
Level | Description |
---|---|
DEBUG | etailed information, typically of interest only when diagnosing problems. Usually at this level the logging output is so low level that it’s not useful to users who are not familiar with the software’s internals. |
INFO | Confirmation that things are working as expected. |
WARNING | An indication that something unexpected happened, or indicative of some problem in the near future (e.g. ‘disk space low’). The software is still working as expected. |
Benchmarks currently support the following frameworks:
- scikit-learn
- daal4py
- cuml
- xgboost
The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms.
You can configure benchmarks by editing a config file. Check config.json schema for more details.
algorithm | benchmark name | sklearn | daal4py | cuml | xgboost |
---|---|---|---|---|---|
DBSCAN | dbscan | ✅ | ✅ | ✅ | ❌ |
RandomForestClassifier | df_clfs | ✅ | ✅ | ✅ | ❌ |
RandomForestRegressor | df_regr | ✅ | ✅ | ✅ | ❌ |
pairwise_distances | distances | ✅ | ✅ | ❌ | ❌ |
KMeans | kmeans | ✅ | ✅ | ✅ | ❌ |
KNeighborsClassifier | knn_clsf | ✅ | ❌ | ✅ | ❌ |
LinearRegression | linear | ✅ | ✅ | ✅ | ❌ |
LogisticRegression | log_reg | ✅ | ✅ | ✅ | ❌ |
PCA | pca | ✅ | ✅ | ✅ | ❌ |
Ridge | ridge | ✅ | ✅ | ✅ | ❌ |
SVM | svm | ✅ | ✅ | ✅ | ❌ |
train_test_split | train_test_split | ✅ | ❌ | ✅ | ❌ |
GradientBoostingClassifier | gbt | ❌ | ❌ | ❌ | ✅ |
GradientBoostingRegressor | gbt | ❌ | ❌ | ❌ | ✅ |
You can launch benchmarks for each algorithm separately. To do this, go to the directory with the benchmark:
cd <framework>
Run the following command:
python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>
The list of supported parameters for each algorithm you can find here: