-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GENERAL SUPPORT]: scheduler + batch question #3247
Comments
Hello there! I was unable to successfully run your repro, it just hangs waiting for trial completion. Based on your description of the issue however, it seems that the scheduler is expecting data that doesn't exist yet for your running trial. The message you're seeing along with the trial being failed comes from Lines 2043 to 2060 in 3c68bd3
If you could provide a smaller repro for the issue I can take a deeper look. Otherwise I'd try one the following:
|
Hey, thanks for the suggestions. I tried overwriting the I could get to run by keeping the
But again I don't get why this is necessary. As to providing a smaller repro I am not sure what you mean by that? |
I see. We recently introduced the concept of recoverable errors for metrics in #3262, for cases similar to these where we don't want to fail a trial just because a metric failed to fetch once. It just requires adding the exception type you're seeing in the Scheduler to This will be available in the next Ax release. |
Thanks for the help. |
We'll do a maintenance release soon - likely within the next week or so but no promises :) |
Great! I will retest all this after the release and close the issue if everything works. |
Question
I am trying to write a scheduler that runs some simulations into batches but still uses AxClient and I cannot make it work properly.
I modified the scheduler tuto to run the job using a multiprocessing pool and write results to a file in a tmp folder where I can later fetch the results from.
I modified my code with the brain function to allow for other people to test (see code below)
Basically when I set the
init_seconds_between_polls
inoptions=SchedulerOptions(run_trials_in_batches=True,init_seconds_between_polls=0.1,trial_type=TrialType.BATCH_TRIAL,batch_size=4),
to a small value (i.e. shorter than the run time of most processes) it fails because of the following:
Scheduler: MetricFetchE INFO: Because branin is an objective, marking trial 19 as TrialStatus.FAILED.
which I guess comes from #L2025 and I don't understand why this happens?
if I set
init_seconds_between_polls=4
to ensure that it polls after they are all done then things seem to work. I don't understand what I am missing.In my real-life case, I don't really know how long I need to wait before polls so I expected that putting a
init_seconds_between_polls
to a small value would just check often if all jobs in the batch are done and then proceed but it is not what happens...Please provide any relevant code snippet if applicable.
Code of Conduct
The text was updated successfully, but these errors were encountered: