[Feature Request] Resuming sweep #1407

ashleve · 2021-02-19T01:19:12Z

🚀 Feature Request

Hi,
is there any way to resume failed hydra sweep?
For example, I run 10 jobs, but job number 6 crashes the whole multirun. Can I somehow resume exactly from job 6?

Describe the solution you'd like
Maybe there could be simply added a parameter allowing user to choose from which job sweep should be started?
E.g. I run python train.py --multirun batch_size=32,64,128,256,512 --start_from 2 and it starts from job number 2, which would be batch_size 128. I know in this case we can simply run sweep for batch_size=128,256,512 instead, but things get a little tricky when there are multiple different parameters that we sweep over.
This solution however wouldn't be very helpful for resuming sweeps of plugin sweepers like optuna, since in most cases optuna needs the history of runs executed so far too decide which parameters to choose next. Is there any chance some kind of other resuming mechanism could be implemented for plugin sweepers?

The text was updated successfully, but these errors were encountered:

jieru-hu · 2021-02-19T01:42:11Z

Thanks for the feature request @hobogalaxy !

For now, for each job Hydra saves all the configs in the jobs .hydra/override.yaml file with which you should be able to recreate a single job easily.

omry · 2021-02-19T01:50:56Z

Hi @hobogalaxy.
As you are hinting, this is not trivial, especially not for arbitrary sweeper plugins.

I think individual sweepers can support it, but this is outside of the scope of Hydra itself.
The basic sweeper, which is the built in sweeper is meant to be simple and I don't think it has room for this functionality.

You are welcome to create your own sweeper plugin that will support this.
If it becomes successful we can consider integrating this functionality into the core basic sweeper.
You can start with the Example sweeper plugin.

Regarding resume functionality in HPO sweepers:
Some backends support some kind of store, for example Ax.
I am not sure how this can be integrated with Hydra though. You are welcome to try to prototype something and make a concrete proposal.

I am closing this as it's currently out of scope.
Feel free to hack a plugin that supports it though. If you have any questions you can ask here or in the chat.

KaleabTessera · 2021-12-13T07:00:52Z

@omry I am just confirming, if you want to continue from a failed multi-run, there is no way to do this? You need to start from the beginning even with the built in sweeper?

This is pretty standard in libraries in ray.

Jasha10 · 2021-12-13T15:08:11Z

Hi @KaleabTessera, yes, this is currently unsupported by Hydra.

This is pretty standard in libraries in ray.

It seems that ray uses checkpointing to enable resuming execution after a failed run. Checkpointing is not implemented by Hydra.

michelkok · 2023-04-13T10:28:38Z

I am only ever using the optuna sweeper, so I don't know if the following is valid for other sweepers.

When using the optuna sweeper and the storage option, the optuna.study will actually be resumed (ref).

However, the sweeper will reset the job_idx to 0 and the n_trials_to_go to the number originally configured (i.e., before interrupting the sweep).

To me, that seems at least confusing. I would advise to replace this line with the following two lines to 'sync' hydra with optuna:

n_trials_to_go = self.n_trials - len(study.trials)
self.job_idx = len(study.trials)

I think this also solves the unsolved part of the issue in #1679.

Shall I create a seperate issue for this or propose a PR - although I suppose it should be refactored for other sweepers as well?

odelalleau · 2023-04-13T12:10:02Z

Shall I create a seperate issue for this or propose a PR - although I suppose it should be refactored for other sweepers as well?

This is Optuna-specific so implementing it for other sweepers would be out of scope. If this resolves #1679 then you can directly propose a PR, while if it solves a different problem it may be best to first create a new specific issue.

michelkok · 2023-04-24T09:52:57Z

Okay, thanks, it's in #2647.

ashleve added the enhancement Enhanvement request label Feb 19, 2021

omry closed this as completed Feb 19, 2021

michelkok mentioned this issue Apr 21, 2023

Optuna resume study #2647

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Resuming sweep #1407

[Feature Request] Resuming sweep #1407

ashleve commented Feb 19, 2021 •

edited

Loading

jieru-hu commented Feb 19, 2021

omry commented Feb 19, 2021

KaleabTessera commented Dec 13, 2021

Jasha10 commented Dec 13, 2021 •

edited

Loading

michelkok commented Apr 13, 2023

odelalleau commented Apr 13, 2023

michelkok commented Apr 24, 2023

[Feature Request] Resuming sweep #1407

[Feature Request] Resuming sweep #1407

Comments

ashleve commented Feb 19, 2021 • edited Loading

🚀 Feature Request

jieru-hu commented Feb 19, 2021

omry commented Feb 19, 2021

KaleabTessera commented Dec 13, 2021

Jasha10 commented Dec 13, 2021 • edited Loading

michelkok commented Apr 13, 2023

odelalleau commented Apr 13, 2023

michelkok commented Apr 24, 2023

ashleve commented Feb 19, 2021 •

edited

Loading

Jasha10 commented Dec 13, 2021 •

edited

Loading