Skip to content

Fix cumulative GPU/CPU memory leak in generation workflow#781

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/fix-memory-leak-in-generation
Draft

Fix cumulative GPU/CPU memory leak in generation workflow#781
Copilot wants to merge 3 commits intomainfrom
copilot/fix-memory-leak-in-generation

Conversation

Copy link
Contributor

Copilot AI commented Mar 6, 2026

Each generation cycle accumulated ~8GB of memory that was never reclaimed — GPU tensors were copied to CPU but the originals remained referenced, and gc.collect() was never called between generations.

Root causes

  • Dangling GPU tensor references in outputs dict — after .detach().cpu() copies were built into extra_outputs, the original accelerator tensors in the mutable outputs dict stayed alive until GC eventually ran (it never did between generations).
  • Unnecessary GPU restoration in CPU-decode pathfinally block was calling pred_latents_for_decode.to(vae_device) after the decode was already complete, re-allocating VRAM for a tensor that was about to be deleted.
  • No GC at generation boundaries — without explicit gc.collect() calls, CPython's reference counter was not releasing accelerator buffers between back-to-back generations.

Changes

  • generate_music_payload.pydel pred_wavs after per-sample CPU extraction; pop all GPU tensor keys from outputs after building extra_outputs; drop local GPU variable bindings; call gc.collect().
    # Before: GPU tensors in outputs dict stayed alive
    extra_outputs = {"encoder_hidden_states": encoder_hidden_states.detach().cpu(), ...}
    # After: clear originals immediately
    for _k in _gpu_keys:
        outputs.pop(_k, None)
    del src_latents, encoder_hidden_states, ...
    gc.collect()
  • generate_music_decode.py — remove pred_latents_for_decode = pred_latents_for_decode.to(vae_device) from the finally block (input tensor is done; only VAE needs restoring); add gc.collect() + _empty_cache() after torch.inference_mode().
  • generate_music.py — add gc.collect() + _empty_cache() after _build_generate_music_success_payload() returns.
  • init_service_offload_context.py — add gc.collect() before _empty_cache() in the offload finally block so parameter references are dropped before the accelerator cache flush.
  • batch_management_wrapper.pydel generator and run gc.collect() + empty_cache() after the generation loop exits to release generator frame state.

Tests

  • Added assertions that all GPU tensor keys are absent from outputs after payload build.
  • Added assertions that all extra_outputs tensors land on CPU.
  • Added regression test confirming pred_latents_for_decode is not moved back to GPU after a successful CPU decode.
Original prompt

The repository has reported a memory leak issue. Each generation operation increases memory consumption significantly (by around 8GB), mainly involving DiT offloading and other components like VAE and text encoder. The memory does not get properly released and grows steadily during the process.

The report indicates that the issue is possibly related to native PyTorch memory usage, and tools like tracemalloc are not identifying a clear root cause. It may involve improperly managed torch.nn.Parameters, or cumulative GPU/CPU offloading issues during the operation in a containerized environment with Docker using the Gradio interface.

Steps to resolve:

  • Instrument the Gradio generation workflow with memory profiling tools, such as memory_profiler and py-spy.
    • Add logs for memory usage before and after each operation (generation step, model loading, offloading steps, etc.).
  • Analyze and optimize the DiT offloading process with proper memory cleanup using torch.cuda.empty_cache() and garbage collection.
  • Ensure nn.Parameters are managed correctly, and no tensors are persisting unintentionally. Free temporary intermediate results where applicable.
  • Add proper lifecycle and offload-to-cpu memory management to involved models/components.
  • Include functionality to log, trace, and manage GPU/CPU memory transfer details.
  • Fix the root cause of any static references or unexpected memory behavior before merging.

Output:

  • A PR with a fixed memory leak issue.
  • Functional and efficient offloading behavior for Gradio-enabled workflows.

This pull request was created from Copilot chat.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 2 commits March 7, 2026 00:02
…ry GPU restoration, add gc.collect()

Co-authored-by: ChuxiJ <30956809+ChuxiJ@users.noreply.github.com>
Co-authored-by: ChuxiJ <30956809+ChuxiJ@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix memory leak issue during generation operations Fix cumulative GPU/CPU memory leak in generation workflow Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants