Fix cumulative GPU/CPU memory leak in generation workflow by Copilot · Pull Request #781 · ace-step/ACE-Step-1.5

Copilot · 2026-03-06T23:50:07Z

Each generation cycle accumulated ~8GB of memory that was never reclaimed — GPU tensors were copied to CPU but the originals remained referenced, and gc.collect() was never called between generations.

Root causes

Dangling GPU tensor references in outputs dict — after .detach().cpu() copies were built into extra_outputs, the original accelerator tensors in the mutable outputs dict stayed alive until GC eventually ran (it never did between generations).
Unnecessary GPU restoration in CPU-decode path — finally block was calling pred_latents_for_decode.to(vae_device) after the decode was already complete, re-allocating VRAM for a tensor that was about to be deleted.
No GC at generation boundaries — without explicit gc.collect() calls, CPython's reference counter was not releasing accelerator buffers between back-to-back generations.

Changes

generate_music_payload.py — del pred_wavs after per-sample CPU extraction; pop all GPU tensor keys from outputs after building extra_outputs; drop local GPU variable bindings; call gc.collect().

# Before: GPU tensors in outputs dict stayed alive
extra_outputs = {"encoder_hidden_states": encoder_hidden_states.detach().cpu(), ...}
# After: clear originals immediately
for _k in _gpu_keys:
    outputs.pop(_k, None)
del src_latents, encoder_hidden_states, ...
gc.collect()

generate_music_decode.py — remove pred_latents_for_decode = pred_latents_for_decode.to(vae_device) from the finally block (input tensor is done; only VAE needs restoring); add gc.collect() + _empty_cache() after torch.inference_mode().
generate_music.py — add gc.collect() + _empty_cache() after _build_generate_music_success_payload() returns.
init_service_offload_context.py — add gc.collect() before _empty_cache() in the offload finally block so parameter references are dropped before the accelerator cache flush.
batch_management_wrapper.py — del generator and run gc.collect() + empty_cache() after the generation loop exits to release generator frame state.

Tests

Added assertions that all GPU tensor keys are absent from outputs after payload build.
Added assertions that all extra_outputs tensors land on CPU.
Added regression test confirming pred_latents_for_decode is not moved back to GPU after a successful CPU decode.

Original prompt

The repository has reported a memory leak issue. Each generation operation increases memory consumption significantly (by around 8GB), mainly involving DiT offloading and other components like VAE and text encoder. The memory does not get properly released and grows steadily during the process.

The report indicates that the issue is possibly related to native PyTorch memory usage, and tools like tracemalloc are not identifying a clear root cause. It may involve improperly managed torch.nn.Parameters, or cumulative GPU/CPU offloading issues during the operation in a containerized environment with Docker using the Gradio interface.

Steps to resolve:

Instrument the Gradio generation workflow with memory profiling tools, such as memory_profiler and py-spy.

Add logs for memory usage before and after each operation (generation step, model loading, offloading steps, etc.).

Analyze and optimize the DiT offloading process with proper memory cleanup using torch.cuda.empty_cache() and garbage collection.

Ensure nn.Parameters are managed correctly, and no tensors are persisting unintentionally. Free temporary intermediate results where applicable.

Add proper lifecycle and offload-to-cpu memory management to involved models/components.

Include functionality to log, trace, and manage GPU/CPU memory transfer details.

Fix the root cause of any static references or unexpected memory behavior before merging.

Output:

A PR with a fixed memory leak issue.

Functional and efficient offloading behavior for Gradio-enabled workflows.

This pull request was created from Copilot chat.

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

…ry GPU restoration, add gc.collect() Co-authored-by: ChuxiJ <30956809+ChuxiJ@users.noreply.github.com>

Co-authored-by: ChuxiJ <30956809+ChuxiJ@users.noreply.github.com>

Initial plan

d2a2ae3

Copilot AI assigned Copilot and ChuxiJ Mar 6, 2026

Copilot started work on behalf of ChuxiJ March 6, 2026 23:50 View session

Copilot AI and others added 2 commits March 7, 2026 00:02

Fix memory leak: delete GPU tensors post-generation, remove unnecessa…

7ae8c8c

…ry GPU restoration, add gc.collect() Co-authored-by: ChuxiJ <30956809+ChuxiJ@users.noreply.github.com>

Fix trailing blank lines in test files (code review feedback)

30d978f

Co-authored-by: ChuxiJ <30956809+ChuxiJ@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix memory leak issue during generation operations~~ Fix cumulative GPU/CPU memory leak in generation workflow Mar 7, 2026

Copilot finished work on behalf of ChuxiJ March 7, 2026 00:05

This was referenced Mar 7, 2026

fills up RAM after 2 generations leading to PC lockup #142

Open

[Bug] LoRA adapters are accumulating instead of being replaced during unload/load cycle #705

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cumulative GPU/CPU memory leak in generation workflow#781

Fix cumulative GPU/CPU memory leak in generation workflow#781
Copilot wants to merge 3 commits intomainfrom
copilot/fix-memory-leak-in-generation

Copilot AI commented Mar 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root causes

Changes

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 6, 2026 •

edited

Loading