Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Auto tuning/node recommendation missing features #1152

Open
tgravescs opened this issue Jun 28, 2024 · 1 comment
Open

[FEA] Auto tuning/node recommendation missing features #1152

tgravescs opened this issue Jun 28, 2024 · 1 comment
Assignees
Labels
feature request New feature or request tools

Comments

@tgravescs
Copy link
Collaborator

tgravescs commented Jun 28, 2024

Is your feature request related to a problem? Please describe.
This is a list of things I ran into that I think is missing or could make the users life easier:

  1. Databricks - specify the number of workers since it doesn't specify spark.executor.instances
  2. EMR - expand recommendation logic to handle multiple GPU nodes
  3. EMR - need to specify the number of workers as well. specifically in the case we go to machines with 1 GPU and cpu could have been using larger nodes with less of them.
  4. . Some of our recommendations combine executors to choose machine size but then the ones we have only have a single GPU so we are halving the number of executors, which with GPU could have a big performance impact.. for instance CPU side on EMR has 16 cores and using 2 executors per node. We will say to use nodes with 32 cores but it only has 1 GPU.
  5. In recommending the number of nodes currently we will recommend more if you have to have more nodes because smaller number of gpus supported to get the same number of executors, but we don't handle the opposite case where maybe we recommend a node with more GPUs per node so can put more executors on it and so would need less nodes.
  6. Node types shouldn't be hardcoded
  7. More Azure node types
  8. Add driver instance recommendation to scala code
  9. enhance python shape recommendation to include num gpus and num nodes
  10. dataproc auth error is not obvious enough when node recommendation fails
  11. switch hardcoded node maps to be lookup from json files generated from node types on each CSP
  12. replace hardcoded instance information in Platform.scala with files generated from CSP information
  13. ...
@tgravescs tgravescs self-assigned this Jul 2, 2024
@tgravescs
Copy link
Collaborator Author

this needs to be broken up @tgravescs to handle that once list is finalized

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request tools
Projects
None yet
Development

No branches or pull requests

2 participants