Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Resume study and avoid OOM when optimizing with Optuna plugin #1679

Open
dianemarquette opened this issue Jun 17, 2021 · 9 comments 路 May be fixed by #2647
Open

[Feature Request] Resume study and avoid OOM when optimizing with Optuna plugin #1679

dianemarquette opened this issue Jun 17, 2021 · 9 comments 路 May be fixed by #2647
Labels
enhancement Enhanvement request help wanted Community help is wanted plugin Plugins realted issues wishlist Low priority feature requests

Comments

@dianemarquette
Copy link

馃殌 Feature Request

I would like to be able:

  • to persist my study in order to resume my hyperparameter search from where I left it
  • set the gc_after_trial parameter of Optuna's study.optimize()

Motivation

Is your feature request related to a problem? Please describe.
I'm always frustrated when my code crashes after 60 trials (out of 100). I suspect an OOM error. Being able to prevent the script from crashing in the first place with gc.collect() would be great. However, at least being able to resume my search from where it stopped would be a game changer.

Pitch

Describe the solution you'd like
I would like to set gc_after_trial to True and a path to store my study parameters after each trial in my Optuna sweeper hydra config.

Describe alternatives you've considered
I read Optuna's documentation but I'm not sure how to make their examples work with Hydra:

Are you willing to open a pull request? (See CONTRIBUTING)
I'm not comfortable enough with Optuna's and Hydra's library to prepare a pull request.

@dianemarquette dianemarquette added the enhancement Enhanvement request label Jun 17, 2021
@omry omry added help wanted Community help is wanted wishlist Low priority feature requests labels Jun 17, 2021
@omry
Copy link
Collaborator

omry commented Jun 17, 2021

Hi @dianemarquette,
I am open to supporting it, although resume study might be harder than in seems (in general resume is not something supported by any Hydra Sweeeper right now).

In any case, we do not have the cycles for it, which means this will happen if someone from the community wants to work toward it.

Supporting gc_after_trial seems like it should be straight forward though.

@dianemarquette
Copy link
Author

@omry Thanks for your quick reply. Any idea when gc_after_trial could be supported?

@omry
Copy link
Collaborator

omry commented Jun 18, 2021

It can be supported after sends a pull request to add support for it.
As I said, this is not a high pri. you can either wait and hope someone eventually do it, or alternatively you can try to do it yourself.

@dianemarquette
Copy link
Author

Ok, thanks for the clarifcation :)

@jieru-hu jieru-hu added the plugin Plugins realted issues label Sep 29, 2021
@cgerum
Copy link
Contributor

cgerum commented Mar 25, 2022

@dianemarquette as of now resuming trials is somewhat supported by the optuna optimizer, by setting a storage backend:

hydra.sweeper.study_name=my_trial
hydra.sweeper.storage=sqlite:///my_trial.sqlite

But this will start the job numbering always from scratch and will therefore overwrite the output directories of individual jobs.

@zhaoedf
Copy link

zhaoedf commented Aug 2, 2022

@dianemarquette as of now resuming trials is somewhat supported by the optuna optimizer, by setting a storage backend:

hydra.sweeper.study_name=my_trial hydra.sweeper.storage=sqlite:///my_trial.sqlite

But this will start the job numbering always from scratch and will therefore overwrite the output directories of individual jobs.

what's more, it did be able to resume study but will inevitably launch multiple replicated runs for a specific params combination, since it will still run 80 times (let's assume for a grid search, there are 80 exps in total.) without launching those exps that have been successfully excecuted.

@zhaoedf
Copy link

zhaoedf commented Aug 2, 2022

and yes, gc collect is a feature i want too, cos now, no matter how i set the n_jobs or pre_dispatch params, the finished jobs will still exists and will exit until next group of parallel trials finish.

@omry
Copy link
Collaborator

omry commented Sep 14, 2022

Hydra has callbacks which can probably be used for it.
See this.

@bablf
Copy link

bablf commented Oct 21, 2022

As far as I can tell we only need to add gc_after_trial to the correct config. I did that but when I run pytest I get two errors because the hydra/sweeper is not as expected.

You can find my code here. I do not have much experience with pytest that's why it's hard for me to debug the error.

Would be great if someone could help me 馃槄

Also I am not sure how to write a test that actually tests what I coded, since gc.collect() does not return anything. I managed to modify a test and added gc_after_trial and the config got build correctly. But we would need a test that actually loads a model with cuda right?
If we do not want to test it then the question becomes if we actually call the optuna implementation def _optimize() (see) and if it is enough to add the key to the config.

@michelkok michelkok linked a pull request Apr 21, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhanvement request help wanted Community help is wanted plugin Plugins realted issues wishlist Low priority feature requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants