Fix cumulative GPU/CPU memory leak in generation workflow#781
Draft
Fix cumulative GPU/CPU memory leak in generation workflow#781
Conversation
…ry GPU restoration, add gc.collect() Co-authored-by: ChuxiJ <30956809+ChuxiJ@users.noreply.github.com>
Co-authored-by: ChuxiJ <30956809+ChuxiJ@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix memory leak issue during generation operations
Fix cumulative GPU/CPU memory leak in generation workflow
Mar 7, 2026
This was referenced Mar 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Each generation cycle accumulated ~8GB of memory that was never reclaimed — GPU tensors were copied to CPU but the originals remained referenced, and
gc.collect()was never called between generations.Root causes
outputsdict — after.detach().cpu()copies were built intoextra_outputs, the original accelerator tensors in the mutableoutputsdict stayed alive until GC eventually ran (it never did between generations).finallyblock was callingpred_latents_for_decode.to(vae_device)after the decode was already complete, re-allocating VRAM for a tensor that was about to be deleted.gc.collect()calls, CPython's reference counter was not releasing accelerator buffers between back-to-back generations.Changes
generate_music_payload.py—del pred_wavsafter per-sample CPU extraction; pop all GPU tensor keys fromoutputsafter buildingextra_outputs; drop local GPU variable bindings; callgc.collect().generate_music_decode.py— removepred_latents_for_decode = pred_latents_for_decode.to(vae_device)from thefinallyblock (input tensor is done; only VAE needs restoring); addgc.collect()+_empty_cache()aftertorch.inference_mode().generate_music.py— addgc.collect()+_empty_cache()after_build_generate_music_success_payload()returns.init_service_offload_context.py— addgc.collect()before_empty_cache()in the offloadfinallyblock so parameter references are dropped before the accelerator cache flush.batch_management_wrapper.py—del generatorand rungc.collect()+empty_cache()after the generation loop exits to release generator frame state.Tests
outputsafter payload build.extra_outputstensors land on CPU.pred_latents_for_decodeis not moved back to GPU after a successful CPU decode.Original prompt
This pull request was created from Copilot chat.
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.