Skip to content

Batch acceleration

David Anderson edited this page Oct 30, 2025 · 3 revisions

The batch completion problem

A typical app might have jobs that take average ~1 hour CPU time to complete. But the turnaround time on a particular host H may be higher because:

  • H has a large work buffer, and other jobs must complete before this one starts
  • H computes only sporadically
  • H is slower than average

So turnaround time on H could be several days. So the 'max delay' setting for the app may need to be, say, 1 week.

If there's a large batch - say 1000 jobs - some of them will get sent to hosts that never complete them. After a week these jobs time out and we resend them to other hosts. But some of these hosts may never complete them, or complete them with large turnaround time.

As a result, the 'makespan' of the batch - the time from submission to 100% completion - may be several weeks.

We'd like to reduce batch makespan using scheduling techniques; we call this 'batch acceleration'.

Proposed design

Our goal is to reduce makespan with minimal complexity. We're not concerned with performance; current projects have a few thousand hosts, not millions.

The basic idea: mark certain hosts as 'low turnaround' (for particular app versions). Mark the last 10% or so of each batch as 'high priority'. Use low-turnaround hosts to run high-priority jobs.

This involves the following components:

  • Scheduler:
    • generate a 'job log' file, with a line for each reported or timed-out job;
    • enforce the above scheduling rule.
  • batch_stats.php (new): periodically scan the job log file, compute statistics, and mark hosts as low-turnaround.
  • batch_accel.php: periodically scan in-progress batches, identifying those that need acceleration. Mark jobs as high-priority, and possibly create new instances.

batch_stats.php

Runs every hour or so.

Skip (and delete) job log entries older than, say, 2 weeks.

First pass:

  • Group entries by batch.
  • For each batch with at least 50% success jobs, compute mean TT of success jobs.

Second pass:

  • For each job in such a batch, its 'TT ratio' is TT/mean if success, ~10 if not.

For a host H and app version AV, the 'recent average TT ratio' RATTR(H, AV) is the average of TT ratios of its jobs using AV (over all batches).

Clear the LTT flag in all HAVs.

Set the LTT if RATTR(H, AV) is in lowest 20th percent, and < 1

Scheduler

Write job log entries for completed jobs.

Job selection:

if a job is high priority
    if we're using a LTT HAV
        boost its score
    else
        lower its score, and don't send at all unless job has been in shmem > 20 min

batch_accel.php

Runs every hour or so.

For each in-progress batch B that's at least 90% complete:
    For each uncompleted WU
        mark WU as high priority
        mark its unsent results as high priority
        if no unsent results and in-prog results are older than average TT
            in wu.target_nresults, schedule transition

feeder

Ideally, want both high and low prio jobs in shmem. That way we'll have work for both LTT and non-LTT hosts.

But this conflicts with e.g. dividing slots between apps.

So probably better to enumerate by priority, then create time.

If we use the above scheduling policy, we may sometimes fill shmem with high-prio jobs. If only non-LTT hosts arrive we won't send anything for 20 min.

In general: we want to have high-prio results only if we have a reasonable number of LTT hosts for that app.

Clone this wiki locally