fix(generation): add timeout, progress fallback, and VRAM pre-flight …#671
fix(generation): add timeout, progress fallback, and VRAM pre-flight …#671Uni404x64 wants to merge 2 commits intoace-step:mainfrom
Conversation
…fixes ace-step#146 Addresses three root causes of the 50% UI freeze on mid-tier GPUs: 1. **Synchronous deadlock** (`generate_music_execute.py`) - Wrapped `service_generate()` in a daemon thread with a configurable wall-clock timeout (default 600 s, override via `ACESTEP_GENERATION_TIMEOUT`). - Safe env-var parsing: non-numeric values and <= 0 fall back to 600 s with a warning instead of crashing at import time or timing out every generation. - Calls `torch.cuda.empty_cache()` before raising `TimeoutError` to recover fragmented VRAM for subsequent attempts. - Logs orphaned `service-generate` thread count on timeout for operator visibility. - Uses `except BaseException` in the worker to ferry `SystemExit` and friends across the thread boundary without swallowing them. - Defensive guard on `_result["outputs"]` prevents a raw `KeyError` if the thread exits without populating either dict. 2. **Estimator cold-start** (`progress.py`) - `_start_diffusion_progress_estimator` no longer returns `(None, None)` when no timing history exists; falls back to `2.5 s/step x batch_size` so the progress bar always animates on first run. - Self-calibrates after the first successful generation via existing mechanism. 3. **VRAM pre-flight check** (`generate_music.py`) - `_vram_preflight_check()` runs before `service_generate`, using the project-standard `get_effective_free_vram_gb()` + `DIT_INFERENCE_VRAM_PER_BATCH` constants from `gpu_config.py`. - Detects CFG mode via `guidance_scale > 1.0` (two DiT passes = 2x activations). - Scales estimate with duration (5 Hz latent rate). - Returns a clean, actionable error payload instead of a silent hang. - No-ops on non-CUDA backends; safe if `get_effective_free_vram_gb()` raises. All logger calls use loguru (`from loguru import logger`) consistent with the rest of the handler package. Tested: Windows 11 / Ryzen 9 5900X / RTX 3070 Ti (8 GB). Batch 2, 3:01 tracks, 100.78 s total. DiT phase 12.8 s. No hang, progress bar animated throughout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds GPU VRAM preflight validation, threading-based timeout protection for service execution, and a progress-estimator fallback for cold starts in the music generation flow; returns early on insufficient VRAM and enforces a configurable generation timeout with diagnostic logging. Changes
Sequence DiagramsequenceDiagram
participant Client
participant GenerateMusicMixin
participant VRAMPreflight as VRAM Preflight
participant Executor as Service Executor
participant WorkerThread as Worker Thread
participant Progress as Progress Tracker
participant GPU
Client->>GenerateMusicMixin: generate_music(inputs)
GenerateMusicMixin->>VRAMPreflight: _vram_preflight_check(batch, duration, guidance)
VRAMPreflight->>GPU: get_effective_free_vram_gb()
GPU-->>VRAMPreflight: free_vram_gb
VRAMPreflight->>GenerateMusicMixin: error payload OR None
alt Insufficient VRAM
GenerateMusicMixin-->>Client: return error payload
else Sufficient VRAM
GenerateMusicMixin->>Executor: prepare and invoke service_generate
Executor->>WorkerThread: spawn daemon thread to run service
WorkerThread->>Progress: start progress estimator (fallback if needed)
WorkerThread->>Executor: produce outputs / raise exception
alt Timeout
Executor->>GPU: clear CUDA cache
Executor-->>Client: raise TimeoutError (with diagnostics)
else Completed
Executor-->>Client: return outputs
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
acestep/core/generation/handler/generate_music_execute.py (2)
156-181: Orphaned daemon threads will accumulate on repeated timeouts.Each timed-out
service_generatecall leaves a daemon thread pinned to the CUDA context. If users retry several times in a row, these pile up and can exhaust VRAM or host memory. The logging at Line 163-165 is a good start for observability, but consider adding a guard that refuses new generation attempts when orphaned thread count exceeds a threshold (e.g. ≥ 3), returning an error payload that recommends a server restart.Not a blocker for this PR, but worth tracking.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@acestep/core/generation/handler/generate_music_execute.py` around lines 156 - 181, Detect excessive orphaned daemon threads created by stalled service_generate runs by reusing the existing stalled count logic (sum(1 for t in threading.enumerate() if t.name == "service-generate")) and add a guard in the generation entry point (the handler around gen_thread/service_generate in generate_music_execute.py) that refuses new generation attempts when stalled >= 3: return an error payload (or raise a specific exception) that clearly states too many orphaned service_generate threads exist and recommends restarting the server, instead of proceeding to launch another generation; keep the existing timeout/cleanup behavior but check this threshold before starting gen_thread so callers get an actionable error immediately.
177-181: Ruff TRY003: inline exception messages.Static analysis flags the long string literals passed directly to
TimeoutErrorandRuntimeError. You could extract them into module-level constants or slim them down, but given these are user-facing diagnostic messages, the current form is readable and acceptable. Suppressing with# noqa: TRY003is fine if you prefer to keep them inline.Also applies to: 191-194
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@acestep/core/generation/handler/generate_music_execute.py` around lines 177 - 181, The inline multi-line exception messages in generate_music_execute.py (the TimeoutError raised using _DEFAULT_GENERATION_TIMEOUT and the similar RuntimeError later) are flagged by Ruff TRY003; either extract those user-facing strings into module-level constants (e.g. TIMEOUT_ERROR_MSG and RUNTIME_ERROR_MSG) and reference those constants in the TimeoutError and RuntimeError raises, or keep the inline strings and add a local suppression comment "# noqa: TRY003" on those raise lines; update the raise sites that use _DEFAULT_GENERATION_TIMEOUT, TimeoutError, and RuntimeError accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@acestep/core/generation/handler/generate_music.py`:
- Around line 51-70: The VRAM check in generate_music is blocking when
get_effective_free_vram_gb() returns 0.0 on error; change the logic so that if
free_gb is None or free_gb <= 0.0 you treat it as "unable to query VRAM" and do
not block generation: call logger.warning (or similar) to record the failed VRAM
query referencing get_effective_free_vram_gb/free_gb/needed_gb and then return
None to proceed; otherwise keep the existing check (if free_gb >= needed_gb
return None, else raise or handle insufficient VRAM).
---
Nitpick comments:
In `@acestep/core/generation/handler/generate_music_execute.py`:
- Around line 156-181: Detect excessive orphaned daemon threads created by
stalled service_generate runs by reusing the existing stalled count logic (sum(1
for t in threading.enumerate() if t.name == "service-generate")) and add a guard
in the generation entry point (the handler around gen_thread/service_generate in
generate_music_execute.py) that refuses new generation attempts when stalled >=
3: return an error payload (or raise a specific exception) that clearly states
too many orphaned service_generate threads exist and recommends restarting the
server, instead of proceeding to launch another generation; keep the existing
timeout/cleanup behavior but check this threshold before starting gen_thread so
callers get an actionable error immediately.
- Around line 177-181: The inline multi-line exception messages in
generate_music_execute.py (the TimeoutError raised using
_DEFAULT_GENERATION_TIMEOUT and the similar RuntimeError later) are flagged by
Ruff TRY003; either extract those user-facing strings into module-level
constants (e.g. TIMEOUT_ERROR_MSG and RUNTIME_ERROR_MSG) and reference those
constants in the TimeoutError and RuntimeError raises, or keep the inline
strings and add a local suppression comment "# noqa: TRY003" on those raise
lines; update the raise sites that use _DEFAULT_GENERATION_TIMEOUT,
TimeoutError, and RuntimeError accordingly.
| if not torch.cuda.is_available(): | ||
| return None | ||
|
|
||
| duration_s = audio_duration or 60.0 | ||
| # CFG doubles forward-pass memory: two DiT evaluations per step. | ||
| dit_key = "base" if guidance_scale > 1.0 else "turbo" | ||
| per_batch_gb = DIT_INFERENCE_VRAM_PER_BATCH.get(dit_key, 0.6) | ||
| # Longer audio = more latent frames (5 Hz rate) = more memory. | ||
| duration_factor = max(1.0, duration_s / 60.0) | ||
| needed_gb = per_batch_gb * actual_batch_size * duration_factor + VRAM_SAFETY_MARGIN_GB | ||
|
|
||
| free_gb = get_effective_free_vram_gb() | ||
| logger.info( | ||
| "[generate_music] VRAM pre-flight: {:.2f} GB free, ~{:.2f} GB needed " | ||
| "(batch={}, duration={:.0f}s, mode={}).", | ||
| free_gb, needed_gb, actual_batch_size, duration_s, dit_key, | ||
| ) | ||
|
|
||
| if free_gb >= needed_gb: | ||
| return None |
There was a problem hiding this comment.
get_effective_free_vram_gb() returning 0.0 on internal error will falsely block generation.
When CUDA is available but get_effective_free_vram_gb() hits an unexpected internal exception, it returns 0.0 (per its own except clause). Since 0.0 < needed_gb is always true, this will surface a "not enough VRAM" error even though the GPU may have plenty of memory.
Consider guarding against this — e.g., treat free_gb <= 0.0 as "unable to query VRAM" and let the generation attempt proceed:
Proposed fix
free_gb = get_effective_free_vram_gb()
+ if free_gb <= 0.0:
+ logger.warning(
+ "[generate_music] VRAM pre-flight: unable to query free VRAM; "
+ "skipping check and allowing generation to proceed."
+ )
+ return None
+
logger.info(
"[generate_music] VRAM pre-flight: {:.2f} GB free, ~{:.2f} GB needed "🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@acestep/core/generation/handler/generate_music.py` around lines 51 - 70, The
VRAM check in generate_music is blocking when get_effective_free_vram_gb()
returns 0.0 on error; change the logic so that if free_gb is None or free_gb <=
0.0 you treat it as "unable to query VRAM" and do not block generation: call
logger.warning (or similar) to record the failed VRAM query referencing
get_effective_free_vram_gb/free_gb/needed_gb and then return None to proceed;
otherwise keep the existing check (if free_gb >= needed_gb return None, else
raise or handle insufficient VRAM).
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
acestep/core/generation/handler/generate_music_execute.py (1)
105-116: Consider callingtorch.cuda.empty_cache()before raisingTimeoutError.When the generation times out due to VRAM pressure, the orphaned CUDA kernels may leave fragmented memory. Clearing the cache here improves the chance that a subsequent retry succeeds — this was noted in the PR objectives but isn't implemented.
♻️ Proposed addition
if gen_thread.is_alive(): + orphaned = sum( + 1 for t in threading.enumerate() if t.name == "service-generate" + ) logger.error( f"[generate_music] service_generate exceeded {_DEFAULT_GENERATION_TIMEOUT}s " f"timeout (batch={actual_batch_size}, steps={inference_steps}, " - f"duration={audio_duration}s). The CUDA operation may still be " - f"running in the background." + f"duration={audio_duration}s). {orphaned} orphaned service-generate " + f"thread(s) still running." ) + try: + import torch + if torch.cuda.is_available(): + torch.cuda.empty_cache() + except Exception: + pass raise TimeoutError(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@acestep/core/generation/handler/generate_music_execute.py` around lines 105 - 116, The timeout branch in the generate_music execution path should release CUDA cache before raising the TimeoutError: inside the if gen_thread.is_alive() block (where logger.error references _DEFAULT_GENERATION_TIMEOUT, actual_batch_size, inference_steps, audio_duration), call torch.cuda.empty_cache() if torch.cuda.is_available() to free fragmented VRAM, optionally log that cache was cleared (using the same logger) and then raise the existing TimeoutError; ensure you import torch where generate_music_execute.py uses CUDA cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@acestep/core/generation/handler/generate_music_execute.py`:
- Line 126: The return unconditionally indexing _result["outputs"] can raise
KeyError if the worker never populated outputs; update the return to safely read
outputs (e.g., use _result.get("outputs", []) or a similar defensive check)
before returning from the function/method in generate_music_execute.py (look for
the return that references _result["outputs"] in the generate_music_execute
handler), so the returned dict always contains a valid outputs value even when
the worker thread failed to populate it.
- Around line 83-84: Replace the narrow except with a BaseException catcher so
worker-thread fatal exceptions (e.g. SystemExit, KeyboardInterrupt) are captured
and transported back to the caller: in the exception handling block that assigns
to _error (the try/except around the worker that currently does "except
Exception as exc" and sets _error["exc"] = exc), change it to catch
BaseException and store the exception object so the caller (the code that checks
gen_thread.is_alive() and later inspects _error/_result) can re-raise or handle
it; keep the same _error key and behavior but use BaseException to match the
intended cross-thread re-raise pattern used by this module
(generate_music_execute.py, the _error dict and the exc variable).
- Around line 9-12: The module-level _DEFAULT_GENERATION_TIMEOUT is currently
set with bare int(os.environ.get(...)) which will raise ValueError on
non-numeric input and allows zero/negative values; change the initialization to
parse ACESTEP_GENERATION_TIMEOUT safely: read the env var, attempt to convert to
int inside a try/except, fall back to 600 on any exception or if the parsed
value is <= 0 (optionally clamp to a minimum like 1), and use that validated
value for _DEFAULT_GENERATION_TIMEOUT; reference the symbol
_DEFAULT_GENERATION_TIMEOUT and the env var name ACESTEP_GENERATION_TIMEOUT when
making the change and add a debug/warn log if parsing fails.
---
Nitpick comments:
In `@acestep/core/generation/handler/generate_music_execute.py`:
- Around line 105-116: The timeout branch in the generate_music execution path
should release CUDA cache before raising the TimeoutError: inside the if
gen_thread.is_alive() block (where logger.error references
_DEFAULT_GENERATION_TIMEOUT, actual_batch_size, inference_steps,
audio_duration), call torch.cuda.empty_cache() if torch.cuda.is_available() to
free fragmented VRAM, optionally log that cache was cleared (using the same
logger) and then raise the existing TimeoutError; ensure you import torch where
generate_music_execute.py uses CUDA cleanup.
| # Maximum wall-clock seconds to wait for service_generate before declaring a hang. | ||
| # Generous default: most generations finish in 30-120s, but large batches on slow | ||
| # GPUs can take several minutes. Override via ACESTEP_GENERATION_TIMEOUT env var. | ||
| _DEFAULT_GENERATION_TIMEOUT = int(os.environ.get("ACESTEP_GENERATION_TIMEOUT", "600")) |
There was a problem hiding this comment.
Unsafe env-var parsing — non-numeric values crash at import time.
The PR description promises safe parsing with fallback to 600 s, but this bare int() raises ValueError on non-numeric input (e.g., ACESTEP_GENERATION_TIMEOUT="abc"), crashing the import. Zero and negative values also produce degenerate behavior (thread.join(timeout=0) returns immediately, always triggering the timeout path).
🐛 Proposed fix: safe parsing with validation
-# Maximum wall-clock seconds to wait for service_generate before declaring a hang.
-# Generous default: most generations finish in 30-120s, but large batches on slow
-# GPUs can take several minutes. Override via ACESTEP_GENERATION_TIMEOUT env var.
-_DEFAULT_GENERATION_TIMEOUT = int(os.environ.get("ACESTEP_GENERATION_TIMEOUT", "600"))
+# Maximum wall-clock seconds to wait for service_generate before declaring a hang.
+# Generous default: most generations finish in 30-120s, but large batches on slow
+# GPUs can take several minutes. Override via ACESTEP_GENERATION_TIMEOUT env var.
+_DEFAULT_GENERATION_TIMEOUT: int = 600
+
+_raw_timeout = os.environ.get("ACESTEP_GENERATION_TIMEOUT")
+if _raw_timeout is not None:
+ try:
+ _parsed = int(_raw_timeout)
+ if _parsed > 0:
+ _DEFAULT_GENERATION_TIMEOUT = _parsed
+ else:
+ logger.warning(
+ f"ACESTEP_GENERATION_TIMEOUT={_raw_timeout!r} is <= 0; "
+ f"falling back to {_DEFAULT_GENERATION_TIMEOUT}s."
+ )
+ except ValueError:
+ logger.warning(
+ f"ACESTEP_GENERATION_TIMEOUT={_raw_timeout!r} is not a valid integer; "
+ f"falling back to {_DEFAULT_GENERATION_TIMEOUT}s."
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@acestep/core/generation/handler/generate_music_execute.py` around lines 9 -
12, The module-level _DEFAULT_GENERATION_TIMEOUT is currently set with bare
int(os.environ.get(...)) which will raise ValueError on non-numeric input and
allows zero/negative values; change the initialization to parse
ACESTEP_GENERATION_TIMEOUT safely: read the env var, attempt to convert to int
inside a try/except, fall back to 600 on any exception or if the parsed value is
<= 0 (optionally clamp to a minimum like 1), and use that validated value for
_DEFAULT_GENERATION_TIMEOUT; reference the symbol _DEFAULT_GENERATION_TIMEOUT
and the env var name ACESTEP_GENERATION_TIMEOUT when making the change and add a
debug/warn log if parsing fails.
| except Exception as exc: | ||
| _error["exc"] = exc |
There was a problem hiding this comment.
except Exception won't catch BaseException subclasses in the worker thread.
If SystemExit (or another BaseException subclass) is raised inside the worker, neither _result nor _error gets populated. The thread exits silently, gen_thread.is_alive() returns False, the error check on line 117 passes, and line 126 crashes with KeyError: 'outputs'.
Use except BaseException here to ferry all exceptions across the thread boundary, matching the stated intent in the PR description. This also addresses the Ruff BLE001 hint — in this specific pattern (worker thread → re-raise on caller) broad catching is intentional.
🐛 Proposed fix
- except Exception as exc:
+ except BaseException as exc:
_error["exc"] = exc🧰 Tools
🪛 Ruff (0.15.1)
[warning] 83-83: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@acestep/core/generation/handler/generate_music_execute.py` around lines 83 -
84, Replace the narrow except with a BaseException catcher so worker-thread
fatal exceptions (e.g. SystemExit, KeyboardInterrupt) are captured and
transported back to the caller: in the exception handling block that assigns to
_error (the try/except around the worker that currently does "except Exception
as exc" and sets _error["exc"] = exc), change it to catch BaseException and
store the exception object so the caller (the code that checks
gen_thread.is_alive() and later inspects _error/_result) can re-raise or handle
it; keep the same _error key and behavior but use BaseException to match the
intended cross-thread re-raise pattern used by this module
(generate_music_execute.py, the _error dict and the exc variable).
| progress_thread.join(timeout=1.0) | ||
| return {"outputs": outputs, "infer_steps_for_progress": infer_steps_for_progress} | ||
|
|
||
| return {"outputs": _result["outputs"], "infer_steps_for_progress": infer_steps_for_progress} No newline at end of file |
There was a problem hiding this comment.
Missing defensive guard on _result["outputs"] — potential KeyError.
The PR description mentions a defensive guard here, but the code directly accesses _result["outputs"] which will raise KeyError if the worker thread exits without populating it (e.g., via an uncaught BaseException). Even with the except BaseException fix above, a guard here provides belt-and-suspenders safety.
🐛 Proposed fix
- return {"outputs": _result["outputs"], "infer_steps_for_progress": infer_steps_for_progress}
+ if "outputs" not in _result:
+ raise RuntimeError(
+ "service_generate thread exited without producing outputs or raising an exception."
+ )
+ return {"outputs": _result["outputs"], "infer_steps_for_progress": infer_steps_for_progress}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@acestep/core/generation/handler/generate_music_execute.py` at line 126, The
return unconditionally indexing _result["outputs"] can raise KeyError if the
worker never populated outputs; update the return to safely read outputs (e.g.,
use _result.get("outputs", []) or a similar defensive check) before returning
from the function/method in generate_music_execute.py (look for the return that
references _result["outputs"] in the generate_music_execute handler), so the
returned dict always contains a valid outputs value even when the worker thread
failed to populate it.
|
please solve comments |
…— fixes #146
Addresses three root causes of the 50% UI freeze on mid-tier GPUs:
Synchronous deadlock (
generate_music_execute.py)service_generate()in a daemon thread with a configurable wall-clock timeout (default 600 s, override viaACESTEP_GENERATION_TIMEOUT).torch.cuda.empty_cache()before raisingTimeoutErrorto recover fragmented VRAM for subsequent attempts.service-generatethread count on timeout for operator visibility.except BaseExceptionin the worker to ferrySystemExitand friends across the thread boundary without swallowing them._result["outputs"]prevents a rawKeyErrorif the thread exits without populating either dict.Estimator cold-start (
progress.py)_start_diffusion_progress_estimatorno longer returns(None, None)when no timing history exists; falls back to2.5 s/step x batch_sizeso the progress bar always animates on first run.VRAM pre-flight check (
generate_music.py)_vram_preflight_check()runs beforeservice_generate, using the project-standardget_effective_free_vram_gb()+DIT_INFERENCE_VRAM_PER_BATCHconstants fromgpu_config.py.guidance_scale > 1.0(two DiT passes = 2x activations).get_effective_free_vram_gb()raises.All logger calls use loguru (
from loguru import logger) consistent with the rest of the handler package.Tested: Windows 11 / Ryzen 9 5900X / RTX 3070 Ti (8 GB). Batch 2, 3:01 tracks, 100.78 s total. DiT phase 12.8 s. No hang, progress bar animated throughout.
Summary by CodeRabbit