Skip to content

Commit

Permalink
Update READMEs
Browse files Browse the repository at this point in the history
  • Loading branch information
umayrh committed Jan 31, 2019
1 parent 697f832 commit fdb566d
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 0 deletions.
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,15 @@
[![Coverage Status](https://coveralls.io/repos/github/umayrh/sparktuner/badge.svg?branch=master)](https://coveralls.io/github/umayrh/sparktuner?branch=master)
[![Maintainability](https://api.codeclimate.com/v1/badges/1b9ae406a6e8b922405a/maintainability)](https://codeclimate.com/github/umayrh/sparktuner/maintainability)


#### Objectives:

- Tune Spark configuration parameters in a hands-off manner
- Learn from tuning experiences over time to:
- Tune more efficiently over time,
- Answer counterfactual questions about application performance, and
- Suggest interventions to improve application performance (potentially even code changes or environment updates apart from configuration setting)

## Setup, build, and usage

#### Setup
Expand Down
36 changes: 36 additions & 0 deletions dev-README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,38 @@ There are more detailed instructions for navigating GitHub's pull request system

## TODO

#### Objectives:
* Tune Spark configuration parameters in a hands-off manner
* Learn from tuning experiences over time to:
* Tune more efficiently over time,
* Answer counterfactual questions about application performance, and
* Suggest interventions to improve application performance (potentially even
code changes or environment updates apart from configuration setting).

#### High-level directions:
* Optimization: which algorithms are most effective for tuning Spark, and why?
The meta question is that of Optimal design, and selection of appropriate objective.
Some possible algos:
* Zero-gradient technique e.g. Random Nelder-Mead
* Multi-armed bandit
* Branch-and-bound sampling e.g. Latin Hypercube sampling
* Bayesian optimization
* Control-theoretic techniques e.g. Kalman filtering
* Learning: is there a way to encode information about Spark parameter’s effect on
runtime/resources for a fixed app?
* Want to be able to ask counterfactual questions about parameters, code change,
or data size change.
* Extensions:
* What other RM frameworks, apart from YARN, should this work on?
* What kind of UI might be useful? Job submission framework?
* What kind of backend service would be useful?
* Should we allow running tests in parallel? What changes would be required?
* What kind of metrics should be stored, and where?
* Externalities:
* How can cluster issues be accounted for?
* How can code issues be accounted for using the query plan and configuration
information?

#### Urgent
* Fix `sort` to write to local filesystem by default. See
[this](https://stackoverflow.com/questions/27299923/how-to-load-local-file-in-sc-textfile-instead-of-hdfs)
Expand Down Expand Up @@ -160,6 +192,10 @@ For a larger set, see
* [sparklens](https://github.com/umayrh/sparklens)
* [dr-elephant](https://github.com/linkedin/dr-elephant)
* [sparklint](https://github.com/groupon/sparklint)
* [sparkMeasure](https://github.com/LucaCanali/sparkMeasure)


* A Methodology for Spark Parameter Tuning [link](delab.csd.auth.gr/papers/BDR2017gt.pdf)

#### YARN

Expand Down

0 comments on commit fdb566d

Please sign in to comment.