-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch #3649
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to chime in here since I did significant work on the XPU side of IPEX for ComfyUI. This patch basically turns on CPU mode for IPEX, doesn't it? I have been meaning to write a patch for something like this for a while so thanks for doing the work to enable this. Had a few comments and nudges on things that could be improved but nothing else looks terribly wrong and I think this will improve everyone's experience with running the project although I am not sure if the bar to get that speed is enough to make it a default option for people to try, IPEX does have a minimum requirement of AVX2 needed on the CPU in order to even work. I would also suggest changing the README too to note this is available. Hopefully, when @comfyanonymous is less busy with things, he can take a look at the PR.
comfy/model_management.py
Outdated
@@ -50,6 +51,7 @@ class CPUState(Enum): | |||
import intel_extension_for_pytorch as ipex | |||
if torch.xpu.is_available(): | |||
xpu_available = True | |||
ipex_available = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no equivalent to check for CPU IPEX besides checking that importing IPEX works? I may rewrite this check to be more comprehensive in the future if this is the case, making the XPU check more comprehensive.
comfy/model_management.py
Outdated
if is_intel_xpu() and not args.disable_ipex_optimize: | ||
self.real_model = ipex.optimize(self.real_model.eval(), graph_mode=True, concat_linear=True) | ||
if (is_cpu_with_ipex() or is_intel_xpu()) and not args.disable_ipex_optimize: | ||
ipex_dtype = torch.bfloat16 if cpu_state == CPUState.CPU and cpu_has_fast_bf16() else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't setting the dtype also apply to XPU here too? All Intel GPUs that IPEX supports have bfloat16 support.
comfy/model_management.py
Outdated
self.real_model = ipex.optimize(self.real_model.eval(), graph_mode=True, concat_linear=True) | ||
if (is_cpu_with_ipex() or is_intel_xpu()) and not args.disable_ipex_optimize: | ||
ipex_dtype = torch.bfloat16 if cpu_state == CPUState.CPU and cpu_has_fast_bf16() else None | ||
self.real_model = ipex.optimize(self.real_model.eval(), dtype=ipex_dtype, graph_mode=True, concat_linear=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want to consider turning on auto_kernel_selection here as an extra parameter to the optimize call. I never called optimize with it because it was listed as a CPU only optimization but it is one of the few flags that doesn't get turned on by default with the O1 optimize flag in ipex.optimize. So you get the following for this line:
self.real_model = ipex.optimize(self.real_model.eval(), dtype=ipex_dtype, graph_mode=True, auto_kernel_selection=True, concat_linear=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked auto_kernel_selection
- it makes no changes, so I don't use it (ipex documentation suggest to use default flags when possible).
For some reason torch.compile(model, backend="ipex")
also makes no difference on CPU (Intel/AMD) or XPU, even though ipex documentation recommends it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I guess. I turned on concat linear and graph mode which weren't in O1 by default because they helped when turned on. Also, torch.compile doesn't help because ipex.optimize with graph mode turns on effectively gives most of the speedups that torch.compile does here. I had written a patch for torch.compile a long time ago but it never gave much speed improvements and wasn't put in the right places to be worth comfy's time to merge.
if cpu_state == CPUState.GPU: | ||
if xpu_available: | ||
return True | ||
return cpu_state == CPUState.GPU and xpu_available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there should be an option to turn off the XPU explicitly here because potentially if someone wants to use the CPU for whatever reason, they can't if they are also using an Intel GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated PR, so that now when user specifies --cpu
flag torch.xpu.is_available()
will never be run, so it won't use XPU code anywhere.
Overall, from design point ComfyUI overuses boolean flags. Better decision would be to use --list-devices
(outputs list of available devices) with --device=auto
, which selects the best device and fallbacks to CPU, if none found. Not only this relieves user from writing --cpu
every time, this allows to specify exact GPU on multi-GPU/TPU/XPU/MPS/whatever systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Modern CPUs have native AVX512 BF16 instructions, which significantly improves matmul and conv2d operations. At this moment PyTorch has almost no native support of these optimization (even with oneDNN it does not use optimal methods), however IPEX adds everything needed. With Bfloat16 instructions UNET steps are 40-50% faster on both AMD and Intel CPUs. There are minor visible changes with bf16, but no avalanche effects, so this feature is enabled by default with new `--use-ipex-bf16=auto` option. It can be disabled with `--use-ipex-bf16=no` even if IPEX is installed and CPU is compatible. IPEX also sloghtly improves performance on older AVX2 and AVX512-only (no Bfloat16) CPUs. Signed-off-by: Sv. Lockal <[email protected]>
@simonlui m thank you for review!
Yes, exactly. By the way, originally I added code to detect if MKL is affected by "cripple AMD" function, but apparently it does not affect ComfyUI (when I tested earlier, it affected diffusers). I checked with On AMD Ryzen 9 7950X3D:
I've described this LD_PRELOAD in readme, but it looks like for generic bf16 user it is not needed.
Yes, I did not add IPEX in requirements.txt (anyways, it only works on Linux/WSL2, and users there will read README). I've added a section about it. I've also updated pull-request, so that if IPEX is found, and bf16 are enabled, it now specifies default VAE and UNET dtype to bf16 (exactly the same was as it done for XPU). So it basically mimics XPU, and where previously memory consumption was 12GB, now it is 7.9GB and so on. |
Signed-off-by: Sv. Lockal <[email protected]>
…on loading stage Signed-off-by: Sv. Lockal <[email protected]>
Modern CPUs have native AVX512 BF16 instructions, which significantly improves matmul and conv2d operations. At this moment PyTorch has almost no native support of these optimization (even with oneDNN it does not use optimal methods), however IPEX adds everything needed.
With Bfloat16 instructions UNET steps are 40-50% faster on both AMD and Intel CPUs.
There are minor visible changes with bf16, but no avalanche effects, so this feature is enabled by default with new
--use-ipex-bf16=auto
option.It can be disabled with
--use-ipex-bf16=no
even if IPEX is installed and CPU is compatible.IPEX also slightly improves performance on older AVX2 and AVX512-only (no Bfloat16) CPUs.
With the following command (note: ComfyUI never mention this, but setting correct environment variables is highly important, see this page), KSampler node is almost 2 times faster (also memory usage is proportionally smaller):
--autocast=no
- 1.68s/it--autocast=auto
- 1.22it/s