You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Reportedly, some VM jobs (and possibly others) get in a "stuck" state where they
don't make progress: no fraction done change, and little CPU usage.
These jobs will eventually be aborted when their elapsed time reaches the rsc_fpops_bound limit,
but this could take weeks or months depending on the limit.
Proposal: have the client try to figure out when a job is stuck.
ACTIVE_TASK new fields:
double stuck_check_elapsed_time
double stuck_check_fraction_done
double stuck_check_cpu_time
(initialize all to zero)
STUCK_CHECK_POLL_PERIOD = 3600
every STUCK_CHECK_POLL_PERIOD seconds
for each active task atp
if non_cpu_intensive: continue
if sporadic: continue
if atp->stuck_check_elapsed_time == 0
atp->stuck_check_elapsed_time = atp->elapsed_time
atp->stuck_check_fraction_done = atp->fraction_done
atp->stuck_check_cpu_time = atp->current_cpu_time
continue
if atp->elapsed_time < atp->stuck_check_elapsed_time + STUCK_CHECK_POLL_PERIOD
continue
if atp->stuck_check_fraction_done == atp->fraction_done
&& (atp->current_cpu_time - atp->stuck_check_cpu_time < 10)
(job is stuck - print warning)
atp->stuck_check_elapsed_time = atp->elapsed_time
atp->stuck_check_fraction_done = atp->fraction_done
atp->stuck_check_cpu_time = atp->current_cpu_time
e.g. in the last hour of running, the fraction done hasn't changed,
and the incremental CPU time is < 10s.
At that point, the client could
notify the user, suggesting that they abort the job
abort the job
Let's do 1) for starters, to make sure that the logic is right,
then at some point do 2).
The text was updated successfully, but these errors were encountered:
I'm Franke Tang, a graduate student currently taking a Distributed Computing course, and part of my final project encourages us to contribute to open issues on GitHub relating to distributed systems. I would like to work on this issue if this has not been implemented yet.
Hello, sorry for the late followup, was working on PRs on other repos. I was looking through code, would app.cpp be a good point to start on this issue?
Reportedly, some VM jobs (and possibly others) get in a "stuck" state where they
don't make progress: no fraction done change, and little CPU usage.
These jobs will eventually be aborted when their elapsed time reaches the rsc_fpops_bound limit,
but this could take weeks or months depending on the limit.
Proposal: have the client try to figure out when a job is stuck.
e.g. in the last hour of running, the fraction done hasn't changed,
and the incremental CPU time is < 10s.
At that point, the client could
Let's do 1) for starters, to make sure that the logic is right,
then at some point do 2).
The text was updated successfully, but these errors were encountered: