You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please describe the feature you have in mind and explain what the current shortcomings are?
Since #15 the full job won't fail if the GlobalJobPreLoad has an error during the AYON environment injection.
However, there may very well be cases where we can now that restarting won't be feasible as reported here
For example:
ifayon_publish_job=="1"andayon_render_job=="1":
raiseRuntimeError(
"Misconfiguration. Job couldn't be both render and publish."
)
Or if the AyonExecutable is not configured at all.
ifnotexe_list:
raiseRuntimeError(
"Path to AYON executable not configured.""Please set it in Ayon Deadline Plugin."
)
Will always fail - since it's set on the job or deadlineplugin and will be the same result for all machines. So it may make sense to fail the job then?
May also make sense to always fail since it should behave quite similar across the workers/machines?
There are also cases where it may make sense to directly mark the Worker as bad for the job.
For example this:
exe_list=get_ayon_executable()
exe=FileUtils.SearchFileList(exe_list)
ifnotexe:
raiseRuntimeError((
"Ayon executable was not found in the semicolon ""separated list \"{}\".""The path to the render executable can be configured"" from the Plugin Configuration in the Deadline Monitor."
).format(exe_list))
This may fail per worker depending on whether it has the exe to be found at any of the paths.
There is a high likelihood that that machine may not find it the next run either?
So we could mark the worker "bad" for the job? Using RepositoryUtils.AddBadSlaveForJob...
How would you imagine the implementation of the feature?
For example raising a dedicated error for when we should fail the job.
classAYONJobConfigurationError(RuntimeError):
"""An error of which we know when raised that the full job should fail and retrying by other machines will be worthless. This may be the case if e.g. not the fully required env vars are configured to inject the AYON environment. """
Or a dedicated error when we should mark the Worker as bad:
classAYONWorkerBadForJobError(RuntimeError):
"""When raised, the worker will be marked bad for the current job. This should be raised when we know that the machine will most likely also fail on subsequent tries. """
However - a server timeout should allow the job to just error and let it requeue with the same worker.. so it can try again.
So a lot of error attributed to not being able to access the server itself should not generate such a hard failure.
Are there any labels you wish to add?
I have added the relevant labels to the enhancement request.
Describe alternatives you've considered:
Just leave it completely up to the Deadline Settings for 'monitoring failures' instead of forcing a behavior onto it. - Yet at the same time, we do want to avoid many machines trying many times if we know early on all would fail regardless.
Additional context:
No response
The text was updated successfully, but these errors were encountered:
Is there an existing issue for this?
Please describe the feature you have in mind and explain what the current shortcomings are?
Since #15 the full job won't fail if the GlobalJobPreLoad has an error during the AYON environment injection.
However, there may very well be cases where we can now that restarting won't be feasible as reported here
For example:
Or if the
AyonExecutable
is not configured at all.Will always fail - since it's set on the job or deadlineplugin and will be the same result for all machines. So it may make sense to fail the job then?
Maybe this:
May also make sense to always fail since it should behave quite similar across the workers/machines?
There are also cases where it may make sense to directly mark the Worker as bad for the job.
For example this:
This may fail per worker depending on whether it has the exe to be found at any of the paths.
There is a high likelihood that that machine may not find it the next run either?
So we could mark the worker "bad" for the job? Using
RepositoryUtils.AddBadSlaveForJob
...How would you imagine the implementation of the feature?
For example raising a dedicated error for when we should fail the job.
Or a dedicated error when we should mark the Worker as bad:
However - a server timeout should allow the job to just error and let it requeue with the same worker.. so it can try again.
So a lot of error attributed to not being able to access the server itself should not generate such a hard failure.
Are there any labels you wish to add?
Describe alternatives you've considered:
Just leave it completely up to the Deadline Settings for 'monitoring failures' instead of forcing a behavior onto it. - Yet at the same time, we do want to avoid many machines trying many times if we know early on all would fail regardless.
Additional context:
No response
The text was updated successfully, but these errors were encountered: