fix(offload): keep the working set resident under balanced when it fits the budget#4968
fix(offload): keep the working set resident under balanced when it fits the budget#4968QualiaRain wants to merge 1 commit into
Conversation
Under the highvram profile (>=24GB cards) balanced offload leaves the low watermark at 0.2 but doesn't pin the UNet, so move_module_to_cpu evicts the UNet to CPU and re-dispatches it onto the GPU every cold-text-encode generation -- even when the whole model fits the offload budget with room to spare. On a 32GB card that roughly doubles txt2img+hires time (measured ~13.5s -> ~6.4s on SDXL, matching offload=none). The eviction check tested only an instantaneous VRAM percentage and never asked whether the working set fits the budget. Add a fit condition: only offload for memory pressure when the model does not fit the budget (model_size() > the gpu byte budget). Placement-only and output-neutral -- under no pressure the set runs resident like offload=none; under genuine pressure (set doesn't fit) the guard no-ops and balanced offloads exactly as before (verified with a tiny budget). Reproduced live on current dev (RTX 5090 32GB) on SDXL and SD1.5. Co-Authored-By: Claude <noreply@anthropic.com>
|
i'm not sure i want to change this behavior especially since you already have several ways to achieve the same functionality:
|
|
makes sense to me. rather than changing the behavior, what about a hint when balanced is re-offloading a module that actually fits in vram? eg: "UNet2DConditionModel is being re-offloaded every gen but fits the budget — add it to "module types not to offload" / "model types not to offload", or raise the low watermark, for up to ~2x speedup." only reason i think it may be worth doing is the default behavior slowed my gens but it wasn't clear to me what was going on. happy to draft the pr |
|
let me think about - i'll keep the pr open so it stays on the radar, just converted to draft for the moment. |
|
slept on it--my prior suggestion of a hint feels inelegant. by its own logic, that'd imply that every divergence from ideal performance for a common use case should forward a hint to the user. that COULD be handy... but maybe the solution is actually just for me to ask what doc improvements would have helped me and suggest those, or maybe a feature that could allow automatic empirical derivation of the optimal configuration on a per-system+per-model basis (fastest generation independent of quality-related parameters like step count or resolution). OR (i'm a fan of simple solutions) simply a note if no offload would be faster (though i already knew that, just not that i could get that speedup while retaining the benefits of balanced offloading). just an update on where my head is at. meanwhile, let me know if there's anything else you'd like me to throw claude at hahah. taking requests :) |
|
i'd more than welcome doc improvements. i also though about placing sd/sdxl as dont quant under high vram profile, but that doesn't fit everyone. maybe we can add a hint to the bottom line in info below image so those numbers are clearer plus hint what can be tuned?
in this case:
|
|
that's brilliant! as a matter of fact, that places the information in the exact place i was staring at the whole time to try to figure out what was going on. i'd love that. i can whip that up as stated along with take a look at the docs if you'd like. |
|
go for it (both, docs and hint) |
Disclosure: this PR was authored by Claude (Anthropic's coding agent), which investigated the bug, wrote the fix, and reproduced it on a live process against current
dev.What this changes
Under the default
balancedoffload, on a card with plenty of spare VRAM, a repeated txt2img + hires workload that changes prompt each generation (a cold text-encode) runs about 2× slower than it should — the SDXL UNet is evicted to CPU and re-dispatched onto the GPU every single generation, even though the whole model fits comfortably in the offload budget with ~24 GB to spare.This adds a one-line fit check to
move_module_to_cpu: only offload a module for memory pressure when the working set doesn't actually fit the GPU budget. When it fits, modules stay resident — the same placementoffload noneproduces, so it's output-neutral. When the model genuinely doesn't fit, eviction behaves exactly as today.The bug
The
≥24 GBhighvramauto-profile (modules/shared_defaults.py) setsbalanced+gpu-max=0.8and pins clip-l / clip-g / VAE intodiffusers_offload_never— but it leaves the low watermark at0.2and doesn't keep the UNet resident. The SDXL static working set is ~6.5 GB (≈20% of 32 GB, right at the watermark), and peak usage during a generation reaches ~31% — above0.2. Somove_module_to_cpuevicts the one big module the profile didn't pin — the UNet — on every cold-text-encode generation, then re-onloads it block-by-block for the base pass and again for hires. AnSD_MOVE_DEBUGtrace shows the 4.78 GB UNet pushed to CPU with only 8.8 GB resident, while the offload budget is 25.6 GB (0.8 × 32):The eviction check is purely an instantaneous-percent test; it never asks whether the model fits the budget:
It only looks intermittent in normal use because it fires on the generations where the text-encode is cold (changed prompt / cache miss). Repeating a prompt hides it;
sd_textencoder_cache_size = 0makes it reproduce every generation.The fix
model_size()is the summed size of the tracked modules (GB);self.gpuis the byte budgetgpu_memory * diffusers_offload_max_gpu_memory. So the added clause reads "…and the working set doesn't fit the budget." Placement-only — under no memory pressure the set runs resident, identical tooffload none.Measurements (fresh, current
dev, RTX 5090 / 32 GB, Windows)Fresh server boot per mode, default SDP attention, 2 warm-ups + 6 reps, cold text-encode each gen (
sd_textencoder_cache_size=0), fixed seed, rotating prompts. txt2img 1024² (25 steps, DPM++ 2M) + hires ×1.5 (15 steps). Per-genProcessed: timerstotal, seconds:balanced(default, stock highvram)balanced+ this fix (set fits)offload none(reference)balanced+ fix,max_gpu_memory=0.1(set does not fit)balanced(default, stock highvram)balanced+ this fixoffload nonespeed (~2.1×, 13.47→6.43 vs 6.30 reference) and drives theonload/offloadtimers to 0.max_gpu_memory=0.1) where the set genuinely doesn't fit, the guard no-ops andbalancedstill offloads (19.37 s,offload>0) — the fix changes behavior only when there's room to spare.balancedand the fix removes it (8.45→7.61). The penalty is smaller because the SD1.5 UNet (~1.6 GB) is far smaller than SDXL's (~4.8 GB), so the round-trip cost scales with the offloaded module — exactly what the root cause predicts.Why a fit condition rather than just raising the low watermark
Raising
diffusers_offload_min_gpu_memoryin thehighvramprofile also fixes the 32 GB case, but no single fixed value is correct: a large model that doesn't fit needs the low watermark low (aggressive room-keeping to swap components in and out), while a small model on a big card wants it high (stay resident). They pull in opposite directions, so the right fix is a fit condition, not a new constant — which also covers the 12–24 GB band, where the profile pins nothing. (This matches howaccelerate/diffusersmodel-CPU-offload and ComfyUI decide: keep what fits resident, spill only the excess.)What I tested — and what I did not
Reproduced and fixed live on SDXL and SD1.5, 32 GB RTX 5090 (Blackwell), Windows. That's two models but one card and one denoiser family (UNet), so please treat the following as untested and worth a look before/after merge:
*Transformer2DModel). The guard is architecture-agnostic by construction (it compares total model size against the budget; it has no module-class logic), and for a model too large to fit it provably no-ops — but I have not run them.model_size()is the static module sum and doesn't include peak activation; right at the budget boundary a set that "fits" by static size could still spike. On amply-provisioned cards (the case this targets) there's wide margin.min_gpu_memory = 0(forced-eviction use case). The guard keeps a fitting set resident, which would override an intentional low=0 on a card where the model fits. If you'd rather low=0 keep its "always evict" meaning, gating the new clause onself.min_watermark > 0does that — I left it out to keep the change minimal and because I haven't measured that variant.Relationship to #4513
Same end-symptom as #4513 (closed) — sequential gens decaying as offloaded weights don't stay resident — but #4513 was LoRA-specific (the diffusers LoRA loader stripping accelerate hooks). Here there's no LoRA; the trigger is just a cold text-encode, so it's a distinct still-live path.
A note on scope
This touches an offload default you chose deliberately, so I've kept it to a single conditional and I'm treating it as a proposal — happy to gate it differently (opt-in,
min>0, profile-only), or to make it ahighvram-profile watermark change instead, whichever you prefer.