Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch #3649

AngryLoki · 2024-06-04T23:18:24Z

Modern CPUs have native AVX512 BF16 instructions, which significantly improves matmul and conv2d operations. At this moment PyTorch has almost no native support of these optimization (even with oneDNN it does not use optimal methods), however IPEX adds everything needed.

With Bfloat16 instructions UNET steps are 40-50% faster on both AMD and Intel CPUs.
There are minor visible changes with bf16, but no avalanche effects, so this feature is enabled by default with new --use-ipex-bf16=auto option.
It can be disabled with --use-ipex-bf16=no even if IPEX is installed and CPU is compatible.

IPEX also slightly improves performance on older AVX2 and AVX512-only (no Bfloat16) CPUs.

With the following command (note: ComfyUI never mention this, but setting correct environment variables is highly important, see this page), KSampler node is almost 2 times faster (also memory usage is proportionally smaller):

LD_PRELOAD=libtrick.so:/src/oneapi/compiler/2024.0/lib/libiomp5.so:/usr/lib64/libtcmalloc.so \
KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 OMP_NUM_THREADS=16 \
numactl -C 0-15 -m 0 python main.py --cpu

`--autocast=no` - 1.68s/it	`--autocast=auto` - 1.22it/s

simonlui

Going to chime in here since I did significant work on the XPU side of IPEX for ComfyUI. This patch basically turns on CPU mode for IPEX, doesn't it? I have been meaning to write a patch for something like this for a while so thanks for doing the work to enable this. Had a few comments and nudges on things that could be improved but nothing else looks terribly wrong and I think this will improve everyone's experience with running the project although I am not sure if the bar to get that speed is enough to make it a default option for people to try, IPEX does have a minimum requirement of AVX2 needed on the CPU in order to even work. I would also suggest changing the README too to note this is available. Hopefully, when @comfyanonymous is less busy with things, he can take a look at the PR.

simonlui · 2024-06-14T04:48:34Z

comfy/model_management.py

@@ -50,6 +51,7 @@ class CPUState(Enum):
    import intel_extension_for_pytorch as ipex
    if torch.xpu.is_available():
        xpu_available = True
+    ipex_available = True


Is there no equivalent to check for CPU IPEX besides checking that importing IPEX works? I may rewrite this check to be more comprehensive in the future if this is the case, making the XPU check more comprehensive.

simonlui · 2024-06-14T05:01:11Z

comfy/model_management.py

-        if is_intel_xpu() and not args.disable_ipex_optimize:
-            self.real_model = ipex.optimize(self.real_model.eval(), graph_mode=True, concat_linear=True)
+        if (is_cpu_with_ipex() or is_intel_xpu()) and not args.disable_ipex_optimize:
+            ipex_dtype = torch.bfloat16 if cpu_state == CPUState.CPU and cpu_has_fast_bf16() else None


Wouldn't setting the dtype also apply to XPU here too? All Intel GPUs that IPEX supports have bfloat16 support.

simonlui · 2024-06-14T05:02:27Z

comfy/model_management.py

-            self.real_model = ipex.optimize(self.real_model.eval(), graph_mode=True, concat_linear=True)
+        if (is_cpu_with_ipex() or is_intel_xpu()) and not args.disable_ipex_optimize:
+            ipex_dtype = torch.bfloat16 if cpu_state == CPUState.CPU and cpu_has_fast_bf16() else None
+            self.real_model = ipex.optimize(self.real_model.eval(), dtype=ipex_dtype, graph_mode=True, concat_linear=True)


I think you want to consider turning on auto_kernel_selection here as an extra parameter to the optimize call. I never called optimize with it because it was listed as a CPU only optimization but it is one of the few flags that doesn't get turned on by default with the O1 optimize flag in ipex.optimize. So you get the following for this line:
self.real_model = ipex.optimize(self.real_model.eval(), dtype=ipex_dtype, graph_mode=True, auto_kernel_selection=True, concat_linear=True)

Checked auto_kernel_selection - it makes no changes, so I don't use it (ipex documentation suggest to use default flags when possible).

For some reason torch.compile(model, backend="ipex") also makes no difference on CPU (Intel/AMD) or XPU, even though ipex documentation recommends it.

Alright, I guess. I turned on concat linear and graph mode which weren't in O1 by default because they helped when turned on. Also, torch.compile doesn't help because ipex.optimize with graph mode turns on effectively gives most of the speedups that torch.compile does here. I had written a patch for torch.compile a long time ago but it never gave much speed improvements and wasn't put in the right places to be worth comfy's time to merge.

simonlui · 2024-06-14T05:04:28Z

comfy/model_management.py

-    if cpu_state == CPUState.GPU:
-        if xpu_available:
-            return True
+    return cpu_state == CPUState.GPU and xpu_available


I wonder if there should be an option to turn off the XPU explicitly here because potentially if someone wants to use the CPU for whatever reason, they can't if they are also using an Intel GPU.

I've updated PR, so that now when user specifies --cpu flag torch.xpu.is_available() will never be run, so it won't use XPU code anywhere.

Overall, from design point ComfyUI overuses boolean flags. Better decision would be to use --list-devices (outputs list of available devices) with --device=auto, which selects the best device and fallbacks to CPU, if none found. Not only this relieves user from writing --cpu every time, this allows to specify exact GPU on multi-GPU/TPU/XPU/MPS/whatever systems.

Looks good to me.

Modern CPUs have native AVX512 BF16 instructions, which significantly improves matmul and conv2d operations. At this moment PyTorch has almost no native support of these optimization (even with oneDNN it does not use optimal methods), however IPEX adds everything needed. With Bfloat16 instructions UNET steps are 40-50% faster on both AMD and Intel CPUs. There are minor visible changes with bf16, but no avalanche effects, so this feature is enabled by default with new `--use-ipex-bf16=auto` option. It can be disabled with `--use-ipex-bf16=no` even if IPEX is installed and CPU is compatible. IPEX also sloghtly improves performance on older AVX2 and AVX512-only (no Bfloat16) CPUs. Signed-off-by: Sv. Lockal <[email protected]>

AngryLoki · 2024-06-16T20:12:25Z

@simonlui m thank you for review!

This patch basically turns on CPU mode for IPEX, doesn't it?

Yes, exactly. By the way, originally I added code to detect if MKL is affected by "cripple AMD" function, but apparently it does not affect ComfyUI (when I tested earlier, it affected diffusers).

I checked with MKL_ENABLE_INSTRUCTIONS=AVX2 ONEDNN_MAX_CPU_ISA=AVX2 to simulate behavior on AVX2 CPU and it looks like ipex.optimize helps there too.

On AMD Ryzen 9 7950X3D:

	AVX2 (s/it)	AVX512, float32 (s/it)	AVX512 --use-ipex-bf16=yes (s/it)
`--use-ipex=no`	2.01	1.67	N/A
`--use-ipex=yes`	2.00	1.95	0.80
`--use-ipex=yes` + `LD_PRELOAD`	1.69	1.67	0.80

I've described this LD_PRELOAD in readme, but it looks like for generic bf16 user it is not needed.

IPEX does have a minimum requirement of AVX2 needed on the CPU

Yes, I did not add IPEX in requirements.txt (anyways, it only works on Linux/WSL2, and users there will read README). I've added a section about it.

I've also updated pull-request, so that if IPEX is found, and bf16 are enabled, it now specifies default VAE and UNET dtype to bf16 (exactly the same was as it done for XPU). So it basically mimics XPU, and where previously memory consumption was 12GB, now it is 7.9GB and so on.

Signed-off-by: Sv. Lockal <[email protected]>

…on loading stage Signed-off-by: Sv. Lockal <[email protected]>

AngryLoki requested a review from comfyanonymous as a code owner June 4, 2024 23:18

simonlui reviewed Jun 14, 2024

View reviewed changes

AngryLoki force-pushed the cpu-autocast branch from 9ae59d2 to 8fbc9ed Compare June 16, 2024 19:23

simonlui approved these changes Jun 16, 2024

View reviewed changes

AngryLoki added 2 commits June 17, 2024 14:00

Add note about AVX-2 CPUs with IPEX

e39b5a5

Signed-off-by: Sv. Lockal <[email protected]>

Remove autocast: it makes no difference, as ComfyUI converts to bf16 …

b82949e

…on loading stage Signed-off-by: Sv. Lockal <[email protected]>

mcmonkey4eva added the Feature A new feature to add to ComfyUI. label Jun 28, 2024

mcmonkey4eva approved these changes Jun 28, 2024

View reviewed changes

mcmonkey4eva added the Needs Testing Please test this issue and report results label Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch #3649

Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch #3649

AngryLoki commented Jun 4, 2024 •

edited

Loading

simonlui left a comment

simonlui Jun 14, 2024

simonlui Jun 14, 2024

simonlui Jun 14, 2024 •

edited

Loading

AngryLoki Jun 16, 2024

simonlui Jun 16, 2024

simonlui Jun 14, 2024

AngryLoki Jun 16, 2024

simonlui Jun 16, 2024

AngryLoki commented Jun 16, 2024

Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch #3649

Are you sure you want to change the base?

Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch #3649

Conversation

AngryLoki commented Jun 4, 2024 • edited Loading

simonlui left a comment

Choose a reason for hiding this comment

simonlui Jun 14, 2024

Choose a reason for hiding this comment

simonlui Jun 14, 2024

Choose a reason for hiding this comment

simonlui Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

AngryLoki Jun 16, 2024

Choose a reason for hiding this comment

simonlui Jun 16, 2024

Choose a reason for hiding this comment

simonlui Jun 14, 2024

Choose a reason for hiding this comment

AngryLoki Jun 16, 2024

Choose a reason for hiding this comment

simonlui Jun 16, 2024

Choose a reason for hiding this comment

AngryLoki commented Jun 16, 2024

AngryLoki commented Jun 4, 2024 •

edited

Loading

simonlui Jun 14, 2024 •

edited

Loading