Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Resuming sweep #1407

Closed
ashleve opened this issue Feb 19, 2021 · 7 comments
Closed

[Feature Request] Resuming sweep #1407

ashleve opened this issue Feb 19, 2021 · 7 comments
Labels
enhancement Enhanvement request

Comments

@ashleve
Copy link

ashleve commented Feb 19, 2021

🚀 Feature Request

Hi,
is there any way to resume failed hydra sweep?
For example, I run 10 jobs, but job number 6 crashes the whole multirun. Can I somehow resume exactly from job 6?

Describe the solution you'd like
Maybe there could be simply added a parameter allowing user to choose from which job sweep should be started?
E.g. I run python train.py --multirun batch_size=32,64,128,256,512 --start_from 2 and it starts from job number 2, which would be batch_size 128. I know in this case we can simply run sweep for batch_size=128,256,512 instead, but things get a little tricky when there are multiple different parameters that we sweep over.
This solution however wouldn't be very helpful for resuming sweeps of plugin sweepers like optuna, since in most cases optuna needs the history of runs executed so far too decide which parameters to choose next. Is there any chance some kind of other resuming mechanism could be implemented for plugin sweepers?

@ashleve ashleve added the enhancement Enhanvement request label Feb 19, 2021
@jieru-hu
Copy link
Contributor

Thanks for the feature request @hobogalaxy !

For now, for each job Hydra saves all the configs in the jobs .hydra/override.yaml file with which you should be able to recreate a single job easily.

@omry
Copy link
Collaborator

omry commented Feb 19, 2021

Hi @hobogalaxy.
As you are hinting, this is not trivial, especially not for arbitrary sweeper plugins.

I think individual sweepers can support it, but this is outside of the scope of Hydra itself.
The basic sweeper, which is the built in sweeper is meant to be simple and I don't think it has room for this functionality.

You are welcome to create your own sweeper plugin that will support this.
If it becomes successful we can consider integrating this functionality into the core basic sweeper.
You can start with the Example sweeper plugin.

Regarding resume functionality in HPO sweepers:
Some backends support some kind of store, for example Ax.
I am not sure how this can be integrated with Hydra though. You are welcome to try to prototype something and make a concrete proposal.

I am closing this as it's currently out of scope.
Feel free to hack a plugin that supports it though. If you have any questions you can ask here or in the chat.

@omry omry closed this as completed Feb 19, 2021
@KaleabTessera
Copy link

@omry I am just confirming, if you want to continue from a failed multi-run, there is no way to do this? You need to start from the beginning even with the built in sweeper?

This is pretty standard in libraries in ray.

@Jasha10
Copy link
Collaborator

Jasha10 commented Dec 13, 2021

Hi @KaleabTessera, yes, this is currently unsupported by Hydra.

This is pretty standard in libraries in ray.

It seems that ray uses checkpointing to enable resuming execution after a failed run. Checkpointing is not implemented by Hydra.

@michelkok
Copy link

I am only ever using the optuna sweeper, so I don't know if the following is valid for other sweepers.

When using the optuna sweeper and the storage option, the optuna.study will actually be resumed (ref).

However, the sweeper will reset the job_idx to 0 and the n_trials_to_go to the number originally configured (i.e., before interrupting the sweep).

To me, that seems at least confusing. I would advise to replace this line with the following two lines to 'sync' hydra with optuna:

n_trials_to_go = self.n_trials - len(study.trials)
self.job_idx = len(study.trials)

I think this also solves the unsolved part of the issue in #1679.

Shall I create a seperate issue for this or propose a PR - although I suppose it should be refactored for other sweepers as well?

@odelalleau
Copy link
Collaborator

Shall I create a seperate issue for this or propose a PR - although I suppose it should be refactored for other sweepers as well?

This is Optuna-specific so implementing it for other sweepers would be out of scope. If this resolves #1679 then you can directly propose a PR, while if it solves a different problem it may be best to first create a new specific issue.

@michelkok
Copy link

Okay, thanks, it's in #2647.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhanvement request
Projects
None yet
Development

No branches or pull requests

7 participants