DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks(ICPP22,TPDS Revision)

Big data frameworks usually provide a large number of performance-related parameters. Online auto-tuning these parameters based on deep reinforcement learning (DRL) to achieve a better performance has shown their advantages over search-based and machine learning-based approaches. Unfortunately, the time cost during the online tuning phase of conventional DRL-based methods is still heavy, especially for big data applications. To reduce the total online tuning cost and increase the adaptability: 1) DeepCAT+ utilizes the TD3 algorithm instead of DDPG to alleviate value overestimation; 2) DeepCAT+ modifies the conventional experience replay to fully utilize the rare but valuable transitions via a novel reward-driven prioritized experience replay mechanism; 3) DeepCAT+ designs a Twin-Q Optimizer to estimate the execution time of each action without the costly configuration evaluation and optimize the sub-optimal ones to achieve a low-cost exploration-exploitation tradeoff; 4) Furthermore, DeepCAT+ also implements an Online Continual Learner module based on Progressive Neural Networks to transfer knowledge from historical tuning experiences.

New features in DeepCAT+ beyond ICPP22 Paper DeepCAT

Progressive Neural Networks (PNN) based Online Continual Learner to enhance the adaptability for dynamic workloads and hardware environments changes.

Log-based workload features extraction
PNN-based knowledge transfer

Start

Cluster deployment

Install Hadoop distributed environment and file system.
install the Spark computing framework.
Install and compile the hibench testing framework.
Install Ansible Playbook for batch configuration and automated deployment.

Steps for reproducing DeepCAT+’s Results

Data collection: collect offline exploration data, including cluster metric states, configuration values, rewards. The interaction between Python programs and clusters is conducted through Ansible tools, check target/target_spark/readme.md for more details.
Use the data to form memory pool for offline training and save the model, see offline_train() function in DeepCAT.py.
Use the model to tune configuration for big data frameworks using tune() in DeepCAT.py. Note there are two polcies:
- if the workload is known, DeepCAT+ will direct conduct optimization, details in DeepCAT.py.
- if the workload is unknown, DeepCAT+ will use Progressive Neural Networks for continual learning to enhence it's adaptability, details in DeepCAT_with_PNN.py.
Compare DeepCAT with CDBTune, OtterTune and Qtune baselines.

Environment Version

Hadoop 2.7.3
Spark 2.2.2
Hibench 7.0
Ansible

Install dependencies (with python 3.8)

pip install -r requirements.txt

Benchmark

we use 9 worklaods with different input data sizes form Hibench The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis

WordCount (WC)
TeraSort (TS)
PageRank (PR)
KMeans (KM)
Gradient Boosted Trees(GBT)
Nweight (NW)
Principal Component Analysis (PCA)
Aggregation (AGG)
WordCount(for streaming)

Baseline

Datasets

The data collected based on the local 3-node Spark cluster includes the execution time of 4 spark workloads under different configuration values in the dataset, check dataset for more details.
For reinforcement learning training, memory pools consist of transitions(s,a,r,s') is in test_kit/ultimate/memory

Configuraiton details

Description of the performance-critical parameters From Spark, YARN and HDFS
For experiments on Flink, check test_kit/ultimate/flink-experimental/readme.md for more details.

Hardware environments

Seven experimental clusters to extensively evaluate the effectiveness of DeepCAT/DeepCAT+ and its robustness to various hardware environments.

Cluster	Nodes	Cluster types	BD frameworks	Evaluation
Cluster_A	3	Physical machines	Spark	Effectiveness
Cluster_B	3	VMs_1	Spark	Adaptability
Cluster_C	6	Physical machines	Flink	Other BD frameworks
Cluster_D	5	Physical machines + VMs_2	Spark	Heterogeneous clusters
Cluster_E	8	VMs_3	Spark	Large-scale clusters
Cluster_F	10	VMs_3	Spark	Large-scale clusters
Cluster_G	12	VMs_3	Spark	Large-scale clusters

Physical machines: 8 cores, 16GB memory
VMs_1: 8 cores, 8GB memory
VMs_2: 12 cores, 8GB memory
VMs_3: 8 cores, 16GB memory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks(ICPP22,TPDS Revision)

New features in DeepCAT+ beyond ICPP22 Paper DeepCAT

Start

Cluster deployment

Steps for reproducing DeepCAT+’s Results

Environment Version

Install dependencies (with python 3.8)

Benchmark

Baseline

Datasets

Configuraiton details

Hardware environments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
dataset		dataset
fig		fig
target/target_spark		target/target_spark
test_kit/ultimate		test_kit/ultimate
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

wiluen/DeepCAT

Folders and files

Latest commit

History

Repository files navigation

DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks(ICPP22,TPDS Revision)

New features in DeepCAT+ beyond ICPP22 Paper DeepCAT

Start

Cluster deployment

Steps for reproducing DeepCAT+’s Results

Environment Version

Install dependencies (with python 3.8)

Benchmark

Baseline

Datasets

Configuraiton details

Hardware environments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages