-
Notifications
You must be signed in to change notification settings - Fork 457
Adaptive Replication
BOINC's default replication policy replicates a job even if one of the hosts is known to be highly reliable. The overhead of replication is high - at least 50% of total CPU time is spent checking result validity.
Adaptive replication is an optional policy that avoids replicating jobs that are sent to highly reliable hosts. The goal of this policy is to provide a target level of confidence with minimal overhead - perhaps only 5% or 10% of total CPU time.
BOINC maintains the number CV(H, V) of consecutive valid results returned by host H using app version V. This is incremented when a replicated job computed with (H, V) is validated, and is zeroed when such a job is found to be invalid. (V is included because, for example, some hosts may be less reliable for GPU jobs than for CPU jobs).
The adaptive replication policy is as follows.
- Each job is initially marked as unreplicated.
- When sending a job using app version V, the scheduler decides whether to trust the host as follows:
- If CV(H, V) < 10, don't trust the host.
- Otherwise, trust the host randomly with probability 1 - 1/CV(H, V).
- If we decide to trust the host, preferentially send it unreplicated jobs.
- Otherwise, preferentially send it replicated jobs. If we have to send it an unreplicated job, mark it as replicated and create new instances accordingly.
- The result of an unreplicated job can be marked as suspicious (inconclusive) in the validator if a second result should be generated after inspecting the first result.
To use adaptive replication for a given app:
- Set app.target_nresults to 2 in the database.
- Create jobs with target_nresults=1 and min_quorum=1; i.e. include
<target_nresults>1</target_nresults>
<min_quorum>1</min_quorum>
in the input template file.
Database:
- Add "target_nresults" field to app table. Default is zero (app doesn't use adaptive replication).
Scheduler:
- Decide whether to trust host as described above.
- If we send an unreplicated job (i.e., target_nresults=1 and app.target_nresults>1) to an untrusted host, set wu.target_nresults = app.target_nresults and flag the WU for transitioning.