Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch #3649

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

AngryLoki
Copy link

@AngryLoki AngryLoki commented Jun 4, 2024

Modern CPUs have native AVX512 BF16 instructions, which significantly improves matmul and conv2d operations. At this moment PyTorch has almost no native support of these optimization (even with oneDNN it does not use optimal methods), however IPEX adds everything needed.

With Bfloat16 instructions UNET steps are 40-50% faster on both AMD and Intel CPUs.
There are minor visible changes with bf16, but no avalanche effects, so this feature is enabled by default with new --use-ipex-bf16=auto option.
It can be disabled with --use-ipex-bf16=no even if IPEX is installed and CPU is compatible.

IPEX also slightly improves performance on older AVX2 and AVX512-only (no Bfloat16) CPUs.

With the following command (note: ComfyUI never mention this, but setting correct environment variables is highly important, see this page), KSampler node is almost 2 times faster (also memory usage is proportionally smaller):

LD_PRELOAD=libtrick.so:/src/oneapi/compiler/2024.0/lib/libiomp5.so:/usr/lib64/libtcmalloc.so \
KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 OMP_NUM_THREADS=16 \
numactl -C 0-15 -m 0 python main.py --cpu
--autocast=no - 1.68s/it --autocast=auto - 1.22it/s
image image

Copy link
Contributor

@simonlui simonlui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to chime in here since I did significant work on the XPU side of IPEX for ComfyUI. This patch basically turns on CPU mode for IPEX, doesn't it? I have been meaning to write a patch for something like this for a while so thanks for doing the work to enable this. Had a few comments and nudges on things that could be improved but nothing else looks terribly wrong and I think this will improve everyone's experience with running the project although I am not sure if the bar to get that speed is enough to make it a default option for people to try, IPEX does have a minimum requirement of AVX2 needed on the CPU in order to even work. I would also suggest changing the README too to note this is available. Hopefully, when @comfyanonymous is less busy with things, he can take a look at the PR.

@@ -50,6 +51,7 @@ class CPUState(Enum):
import intel_extension_for_pytorch as ipex
if torch.xpu.is_available():
xpu_available = True
ipex_available = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no equivalent to check for CPU IPEX besides checking that importing IPEX works? I may rewrite this check to be more comprehensive in the future if this is the case, making the XPU check more comprehensive.

if is_intel_xpu() and not args.disable_ipex_optimize:
self.real_model = ipex.optimize(self.real_model.eval(), graph_mode=True, concat_linear=True)
if (is_cpu_with_ipex() or is_intel_xpu()) and not args.disable_ipex_optimize:
ipex_dtype = torch.bfloat16 if cpu_state == CPUState.CPU and cpu_has_fast_bf16() else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't setting the dtype also apply to XPU here too? All Intel GPUs that IPEX supports have bfloat16 support.

self.real_model = ipex.optimize(self.real_model.eval(), graph_mode=True, concat_linear=True)
if (is_cpu_with_ipex() or is_intel_xpu()) and not args.disable_ipex_optimize:
ipex_dtype = torch.bfloat16 if cpu_state == CPUState.CPU and cpu_has_fast_bf16() else None
self.real_model = ipex.optimize(self.real_model.eval(), dtype=ipex_dtype, graph_mode=True, concat_linear=True)
Copy link
Contributor

@simonlui simonlui Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to consider turning on auto_kernel_selection here as an extra parameter to the optimize call. I never called optimize with it because it was listed as a CPU only optimization but it is one of the few flags that doesn't get turned on by default with the O1 optimize flag in ipex.optimize. So you get the following for this line:
self.real_model = ipex.optimize(self.real_model.eval(), dtype=ipex_dtype, graph_mode=True, auto_kernel_selection=True, concat_linear=True)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked auto_kernel_selection - it makes no changes, so I don't use it (ipex documentation suggest to use default flags when possible).

For some reason torch.compile(model, backend="ipex") also makes no difference on CPU (Intel/AMD) or XPU, even though ipex documentation recommends it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I guess. I turned on concat linear and graph mode which weren't in O1 by default because they helped when turned on. Also, torch.compile doesn't help because ipex.optimize with graph mode turns on effectively gives most of the speedups that torch.compile does here. I had written a patch for torch.compile a long time ago but it never gave much speed improvements and wasn't put in the right places to be worth comfy's time to merge.

if cpu_state == CPUState.GPU:
if xpu_available:
return True
return cpu_state == CPUState.GPU and xpu_available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there should be an option to turn off the XPU explicitly here because potentially if someone wants to use the CPU for whatever reason, they can't if they are also using an Intel GPU.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated PR, so that now when user specifies --cpu flag torch.xpu.is_available() will never be run, so it won't use XPU code anywhere.

Overall, from design point ComfyUI overuses boolean flags. Better decision would be to use --list-devices (outputs list of available devices) with --device=auto, which selects the best device and fallbacks to CPU, if none found. Not only this relieves user from writing --cpu every time, this allows to specify exact GPU on multi-GPU/TPU/XPU/MPS/whatever systems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Modern CPUs have native AVX512 BF16 instructions, which significantly improves
matmul and conv2d operations. At this moment PyTorch has almost no native support
of these optimization (even with oneDNN it does not use optimal methods),
however IPEX adds everything needed.

With Bfloat16 instructions UNET steps are 40-50% faster on both AMD and Intel CPUs.
There are minor visible changes with bf16, but no avalanche effects, so this feature
is enabled by default with new `--use-ipex-bf16=auto` option.
It can be disabled with `--use-ipex-bf16=no` even if IPEX is installed and CPU is compatible.

IPEX also sloghtly improves performance on older AVX2 and AVX512-only (no Bfloat16) CPUs.

Signed-off-by: Sv. Lockal <[email protected]>
@AngryLoki
Copy link
Author

@simonlui m thank you for review!

This patch basically turns on CPU mode for IPEX, doesn't it?

Yes, exactly. By the way, originally I added code to detect if MKL is affected by "cripple AMD" function, but apparently it does not affect ComfyUI (when I tested earlier, it affected diffusers).

I checked with MKL_ENABLE_INSTRUCTIONS=AVX2 ONEDNN_MAX_CPU_ISA=AVX2 to simulate behavior on AVX2 CPU and it looks like ipex.optimize helps there too.

On AMD Ryzen 9 7950X3D:

AVX2 (s/it) AVX512, float32 (s/it) AVX512 --use-ipex-bf16=yes (s/it)
--use-ipex=no 2.01 1.67 N/A
--use-ipex=yes 2.00 1.95 0.80
--use-ipex=yes + LD_PRELOAD 1.69 1.67 0.80

I've described this LD_PRELOAD in readme, but it looks like for generic bf16 user it is not needed.

IPEX does have a minimum requirement of AVX2 needed on the CPU

Yes, I did not add IPEX in requirements.txt (anyways, it only works on Linux/WSL2, and users there will read README). I've added a section about it.

I've also updated pull-request, so that if IPEX is found, and bf16 are enabled, it now specifies default VAE and UNET dtype to bf16 (exactly the same was as it done for XPU). So it basically mimics XPU, and where previously memory consumption was 12GB, now it is 7.9GB and so on.

@mcmonkey4eva mcmonkey4eva added the Feature A new feature to add to ComfyUI. label Jun 28, 2024
@mcmonkey4eva mcmonkey4eva added the Needs Testing Please test this issue and report results label Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature A new feature to add to ComfyUI. Needs Testing Please test this issue and report results
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants