Skip to content

fix(offload): keep the working set resident under balanced when it fits the budget#4968

Draft
QualiaRain wants to merge 1 commit into
vladmandic:devfrom
QualiaRain:fix/balanced-offload-fit-aware
Draft

fix(offload): keep the working set resident under balanced when it fits the budget#4968
QualiaRain wants to merge 1 commit into
vladmandic:devfrom
QualiaRain:fix/balanced-offload-fit-aware

Conversation

@QualiaRain

Copy link
Copy Markdown
Contributor

Disclosure: this PR was authored by Claude (Anthropic's coding agent), which investigated the bug, wrote the fix, and reproduced it on a live process against current dev.

What this changes

Under the default balanced offload, on a card with plenty of spare VRAM, a repeated txt2img + hires workload that changes prompt each generation (a cold text-encode) runs about 2× slower than it should — the SDXL UNet is evicted to CPU and re-dispatched onto the GPU every single generation, even though the whole model fits comfortably in the offload budget with ~24 GB to spare.

This adds a one-line fit check to move_module_to_cpu: only offload a module for memory pressure when the working set doesn't actually fit the GPU budget. When it fits, modules stay resident — the same placement offload none produces, so it's output-neutral. When the model genuinely doesn't fit, eviction behaves exactly as today.

The bug

The ≥24 GB highvram auto-profile (modules/shared_defaults.py) sets balanced + gpu-max=0.8 and pins clip-l / clip-g / VAE into diffusers_offload_never — but it leaves the low watermark at 0.2 and doesn't keep the UNet resident. The SDXL static working set is ~6.5 GB (≈20% of 32 GB, right at the watermark), and peak usage during a generation reaches ~31% — above 0.2. So move_module_to_cpu evicts the one big module the profile didn't pin — the UNet — on every cold-text-encode generation, then re-onloads it block-by-block for the base pass and again for hires. An SD_MOVE_DEBUG trace shows the 4.78 GB UNet pushed to CPU with only 8.8 GB resident, while the offload budget is 25.6 GB (0.8 × 32):

Offload: type=balanced op=pre:skip:mem gpu=8.770:3.988 perc=0.27:0.2 ram=32.160 current=cpu dtype=torch.bfloat16 quant=None module=UNet2DConditionModel size=4.782

The eviction check is purely an instantaneous-percent test; it never asks whether the model fits the budget:

elif perc_gpu > shared.opts.diffusers_offload_min_gpu_memory:
    op = f'{op}:mem'
    module = do_move(module)

It only looks intermittent in normal use because it fires on the generations where the text-encode is cold (changed prompt / cache miss). Repeating a prompt hides it; sd_textencoder_cache_size = 0 makes it reproduce every generation.

The fix

-        elif perc_gpu > shared.opts.diffusers_offload_min_gpu_memory:
+        # only offload under genuine memory pressure: if the whole working set already fits within the
+        # GPU budget (diffusers_offload_max_gpu_memory) then offloading here just forces a costly
+        # re-onload on the module's next forward -- e.g. the UNet evicted to run the text encoder / VAE
+        # then re-dispatched every generation (~2x slowdown on cards with spare VRAM).
+        elif perc_gpu > shared.opts.diffusers_offload_min_gpu_memory and offload_hook_instance.model_size() > offload_hook_instance.gpu / (1024 ** 3):
             op = f'{op}:mem'
             module = do_move(module)

model_size() is the summed size of the tracked modules (GB); self.gpu is the byte budget gpu_memory * diffusers_offload_max_gpu_memory. So the added clause reads "…and the working set doesn't fit the budget." Placement-only — under no memory pressure the set runs resident, identical to offload none.

Measurements (fresh, current dev, RTX 5090 / 32 GB, Windows)

Fresh server boot per mode, default SDP attention, 2 warm-ups + 6 reps, cold text-encode each gen (sd_textencoder_cache_size=0), fixed seed, rotating prompts. txt2img 1024² (25 steps, DPM++ 2M) + hires ×1.5 (15 steps). Per-gen Processed: timers total, seconds:

model mode total (median) min–max onload offload
SDXL balanced (default, stock highvram) 13.47 12.9–13.6 ~1.7 ~1.3
SDXL balanced + this fix (set fits) 6.43 6.4–6.5 0 0
SDXL offload none (reference) 6.30 6.2–6.6 0 0
SDXL balanced + fix, max_gpu_memory=0.1 (set does not fit) 19.37 19.0–20.5 ~1.1 ~0.85
SD1.5 balanced (default, stock highvram) 8.45 8.4–9.4 ~0.2 ~0.2
SD1.5 balanced + this fix 7.61 7.5–7.7 0 0
  • SDXL: the fix restores full offload none speed (~2.1×, 13.47→6.43 vs 6.30 reference) and drives the onload/offload timers to 0.
  • No-op under real pressure: with a deliberately tiny budget (max_gpu_memory=0.1) where the set genuinely doesn't fit, the guard no-ops and balanced still offloads (19.37 s, offload>0) — the fix changes behavior only when there's room to spare.
  • SD1.5 shows the same mechanism on a second model: it also evicts under default balanced and the fix removes it (8.45→7.61). The penalty is smaller because the SD1.5 UNet (~1.6 GB) is far smaller than SDXL's (~4.8 GB), so the round-trip cost scales with the offloaded module — exactly what the root cause predicts.

Why a fit condition rather than just raising the low watermark

Raising diffusers_offload_min_gpu_memory in the highvram profile also fixes the 32 GB case, but no single fixed value is correct: a large model that doesn't fit needs the low watermark low (aggressive room-keeping to swap components in and out), while a small model on a big card wants it high (stay resident). They pull in opposite directions, so the right fix is a fit condition, not a new constant — which also covers the 12–24 GB band, where the profile pins nothing. (This matches how accelerate/diffusers model-CPU-offload and ComfyUI decide: keep what fits resident, spill only the excess.)

What I tested — and what I did not

Reproduced and fixed live on SDXL and SD1.5, 32 GB RTX 5090 (Blackwell), Windows. That's two models but one card and one denoiser family (UNet), so please treat the following as untested and worth a look before/after merge:

  • Other denoisers — Flux / SD3 (*Transformer2DModel). The guard is architecture-agnostic by construction (it compares total model size against the budget; it has no module-class logic), and for a model too large to fit it provably no-ops — but I have not run them.
  • The 12–24 GB band (no profile pins) and multi-GPU.
  • The working-set-near-budget / activation-headroom edgemodel_size() is the static module sum and doesn't include peak activation; right at the budget boundary a set that "fits" by static size could still spike. On amply-provisioned cards (the case this targets) there's wide margin.
  • Explicit min_gpu_memory = 0 (forced-eviction use case). The guard keeps a fitting set resident, which would override an intentional low=0 on a card where the model fits. If you'd rather low=0 keep its "always evict" meaning, gating the new clause on self.min_watermark > 0 does that — I left it out to keep the change minimal and because I haven't measured that variant.

Relationship to #4513

Same end-symptom as #4513 (closed) — sequential gens decaying as offloaded weights don't stay resident — but #4513 was LoRA-specific (the diffusers LoRA loader stripping accelerate hooks). Here there's no LoRA; the trigger is just a cold text-encode, so it's a distinct still-live path.

A note on scope

This touches an offload default you chose deliberately, so I've kept it to a single conditional and I'm treating it as a proposal — happy to gate it differently (opt-in, min>0, profile-only), or to make it a highvram-profile watermark change instead, whichever you prefer.

Under the highvram profile (>=24GB cards) balanced offload leaves the low
watermark at 0.2 but doesn't pin the UNet, so move_module_to_cpu evicts
the UNet to CPU and re-dispatches it onto the GPU every cold-text-encode
generation -- even when the whole model fits the offload budget with room
to spare. On a 32GB card that roughly doubles txt2img+hires time
(measured ~13.5s -> ~6.4s on SDXL, matching offload=none). The eviction
check tested only an instantaneous VRAM percentage and never asked whether
the working set fits the budget.

Add a fit condition: only offload for memory pressure when the model does
not fit the budget (model_size() > the gpu byte budget). Placement-only
and output-neutral -- under no pressure the set runs resident like
offload=none; under genuine pressure (set doesn't fit) the guard no-ops
and balanced offloads exactly as before (verified with a tiny budget).

Reproduced live on current dev (RTX 5090 32GB) on SDXL and SD1.5.

Co-Authored-By: Claude <noreply@anthropic.com>
@vladmandic

Copy link
Copy Markdown
Owner

i'm not sure i want to change this behavior especially since you already have several ways to achieve the same functionality:

  • add sdxl to "model types not to offload"
  • add UNet2DConditionModel to "module types not to offload"
  • increase "low watermark"

@QualiaRain

Copy link
Copy Markdown
Contributor Author

makes sense to me. rather than changing the behavior, what about a hint when balanced is re-offloading a module that actually fits in vram?

eg: "UNet2DConditionModel is being re-offloaded every gen but fits the budget — add it to "module types not to offload" / "model types not to offload", or raise the low watermark, for up to ~2x speedup."

only reason i think it may be worth doing is the default behavior slowed my gens but it wasn't clear to me what was going on. happy to draft the pr

@vladmandic

vladmandic commented Jun 29, 2026

Copy link
Copy Markdown
Owner

let me think about - i'll keep the pr open so it stays on the radar, just converted to draft for the moment.

@vladmandic vladmandic marked this pull request as draft June 29, 2026 13:59
@QualiaRain

QualiaRain commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

slept on it--my prior suggestion of a hint feels inelegant. by its own logic, that'd imply that every divergence from ideal performance for a common use case should forward a hint to the user. that COULD be handy... but maybe the solution is actually just for me to ask what doc improvements would have helped me and suggest those, or maybe a feature that could allow automatic empirical derivation of the optimal configuration on a per-system+per-model basis (fastest generation independent of quality-related parameters like step count or resolution). OR (i'm a fan of simple solutions) simply a note if no offload would be faster (though i already knew that, just not that i could get that speedup while retaining the benefits of balanced offloading). just an update on where my head is at. meanwhile, let me know if there's anything else you'd like me to throw claude at hahah. taking requests :)

@vladmandic

Copy link
Copy Markdown
Owner

i'd more than welcome doc improvements.

i also though about placing sd/sdxl as dont quant under high vram profile, but that doesn't fit everyone.
one typical use case for hires is to render a much higher resolution image and then you do need all possible vram to decode it, so unet offloading should not be skipped by default.

maybe we can add a hint to the bottom line in info below image so those numbers are clearer plus hint what can be tuned?

Time: 5.52s | total 8.57 pipeline 3.60 decode 1.80 onload 1.02 prompt 0.77 preview 0.76 move 0.46 | GPU 11238 MB 46% | RAM 15.73 GB 24%

in this case:

  • total generate time: 8.57sec
    and then out of that:
  • onload: moving modules from ram to gpu: 1.02 sec
  • prompt: parse & text-encode: 0.77 sec
  • pipeline: unet: 3.60 sec
  • preview: taesd: 0.76 sec
  • move: moving modules from gpu to ram: 0.46 sec
  • vae decode: 1.80 sec

@QualiaRain

Copy link
Copy Markdown
Contributor Author

that's brilliant! as a matter of fact, that places the information in the exact place i was staring at the whole time to try to figure out what was going on. i'd love that. i can whip that up as stated along with take a look at the docs if you'd like.

@vladmandic

Copy link
Copy Markdown
Owner

go for it (both, docs and hint)
i'll have to create an anchor for the hint as there is no element on which it would show, but that's not a problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants