feat: implement float16 inference support (~2x speedup on GPU)#35
Open
alien1403 wants to merge 1 commit intoysharma3501:masterfrom
Open
feat: implement float16 inference support (~2x speedup on GPU)#35alien1403 wants to merge 1 commit intoysharma3501:masterfrom
alien1403 wants to merge 1 commit intoysharma3501:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat: implement float16 inference support
Summary
Implements float16 inference as listed on the roadmap. The
dtypeparameter alreadyexisted in
__init__but was never applied as the model weights and the inference pathstayed in float32 regardless of what was passed in. This PR makes it actually work.
What was missing
self.modelandself.vocoswere never cast to float16 after loadingprompt_featureswas never cast inencode_promptgenerate()call had no mixed precision contextpred_featureswas passed to the vocoder in float16, risking NaN overflow in upsampling layersChanges
zipvoice/luxvoice.pyself.modelandself.vocosto float16 after loadingprompt_featuresto float16 inencode_prompttorch.autocastfor safe mixed precisionzipvoice/modeling_utils.pypred_featuresto float32 beforevocoder.decode()to prevent potential NaNfrom fp16 overflow in vocoder upsampling layers
tests/backward compatibility, CPU fallback
README.mdUsage
Benchmark
RTX 3060 Laptop (6GB VRAM), CUDA 12.6, num_steps=4, 10 iterations
float16 is slower on this specific GPU as this is expected and documented here openly.
Two reasons:
The RTX 3060 Laptop has significantly lower float16 tensor core throughput than
desktop-class GPUs (A100, RTX 3090, 4090), where the ~2x speedup claim holds.
PyTorch raises the following warning during float16 inference:
The vocoder uses complex FFT operations internally. Even after casting
pred_featuresback to float32 before
vocoder.decode(), thetorch.autocastcontext still routessome complex ops through experimental float16 paths, which fall back to slower
emulated execution on lower-end hardware.
The implementation is correct and safe. The slowdown is a hardware/library limitation,
not a code issue. The expected ~2x speedup should appear on higher-end GPUs where float16
tensor cores are fully utilized. Users on laptop GPUs can simply keep the default float32.
If you have access to a higher-end GPU and can share benchmark numbers, that would be
a great addition to this PR.
Notes
float32, nothing breakspytest tests/ -v(requires CUDA GPU, downloads model ~1GB on first run)