sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor #12734

zhouwg · 2025-04-03T09:02:58Z

1.the logic here should similar to the default ggml backend or exactly same to ggml-cann(copy data from host to device), so I think this memcopy is redundant and might-be brings some overhead in this function. of course, might-be my misunderstanding if there are some special tech details in sycl's internal.
2.I think we should avoid dynamic memory allocation/free in such performance-sensitive scenario.

…set_tensor

Rbiessy · 2025-04-03T09:35:34Z

ggml/src/ggml-sycl/ggml-sycl.cpp

    SYCL_CHECK(
        CHECK_TRY_ERROR(dpct::dev_mgr::instance().get_device(ctx->device).queues_wait_and_throw()));


That makes sense to me. I think this wait before copying should also not be needed. Looking at the other backends it seems to be llama's responsibility to ensure the data is available on the host before calling buffer_set_tensor. Do you think we could remove that as part of this PR too?

I think you are right, we discussed a similar issue in #12580 and the tensor memset API didn't need the wait. Memcpy should be equivalent.

Merging as is and we'll try to investigate the waits separately.

NeoZhangJianyu · 2025-04-07T02:09:56Z

Yes, the host to host copy is not needed.

I remember some wait() are mandatory in this cpp file. Remove them will lead to wrong result.
Before remove the wait(), please test by UT to avoid the unexpected impact to functions.

slaren · 2025-04-07T15:24:10Z

IIRC, this copy existed to workaround an issue with Intel drivers reading data from a memory mapped file.

zhouwg · 2025-04-07T15:30:30Z

IIRC, this copy existed to workaround an issue with Intel drivers reading data from a memory mapped file.

is this copy related to some cache?

slaren · 2025-04-07T15:33:09Z

No, when loading a model with mmap enabled the pointer received by this function may be a pointer within the memory mapped model file. Apparently the Intel driver was not able to copy data to VRAM from a memory mapped file, so this copy was added as a workaround.

zhouwg · 2025-04-07T15:35:30Z

No, when loading a model with mmap enabled the pointer received by this function may be a pointer within the memory mapped model file. Apparently the Intel driver was not able to copy data to VRAM from a memory mapped file, so this copy was added as a workaround.

so this is my misunderstanding and there are some special tech details in sycl's internal. then we should revert this PR accordingly?

necessary internal comments might-be/should-be added in this function if this PR should be reverted accordingly because I really don't understand why there is a host-to-host copy here before I submitted this PR.

slaren · 2025-04-07T15:40:23Z

This is where the copy was added: #6622
Maybe the Intel driver bug has been fixed and it is no longer necessary?

zhouwg · 2025-04-07T15:49:32Z

This is where the copy was added: #6622 Maybe the Intel driver bug has been fixed and it is no longer necessary?

you are a highly-speed and widely-knowledge computer and hope that bug has fixed in Intel driver.

btw, could you take a moment to look at my PR about Qualcomm NPU and provide some guidance with that PR?

slaren · 2025-04-07T16:04:49Z

I am sorry, I don't know anything about QNN or Qualcomm NPUs, and cannot give you any advice about that. From a quick look, the ggml-backend interface implementation looks ok.

zhouwg · 2025-04-07T16:16:16Z

No, when loading a model with mmap enabled the pointer received by this function may be a pointer within the memory mapped model file. Apparently the Intel driver was not able to copy data to VRAM from a memory mapped file, so this copy was added as a workaround.

why you said "apparently the Intel driver was not able to copy data to (GPU) VRAM from a memory mapped file" when they are both in the same process address space? could you help to provide a more clear explanation? thanks so much!

slaren · 2025-04-07T16:40:36Z

I don't know the details. Memory mapped files rely on page faults to allocate physical memory and fill it with the contents of the file on demand, my guess is that the driver is accessing the memory in a way that interferes with this mechanism.

zhouwg · 2025-04-08T02:08:04Z

interesting problem: the pointer from mmap and malloc are both VA and both in the userspace address space, I don't understand why the driver can handle the pointer from malloc and can't handle the pointer from mmap at the same time?

cupy/cupy#3431

NeoZhangJianyu · 2025-04-08T02:08:10Z

yes, this fix is created by me.
I forget it.

The reason is the mmap() can't work well on Intel PVC GPU.
So, we have to use memcpy from mmap to host buffer, then host buffer to device.

This PR should be reverted.

NeoZhangJianyu · 2025-04-08T02:09:53Z

@zhouwg
Sorry, I suggest reverting this PR.

How do you think?

zhouwg · 2025-04-08T02:13:45Z

No problem, I also agree we should revert this commit.

mightbe we can add what you mentioned in this function:
"The reason is the mmap() can't work well on Intel PVC GPU.
So, we have to use memcpy from mmap to host buffer, then host buffer to device."

one more thing, can we use a pinned memory(a host CPU memory which intend to faster transfer between CPU system memory and device memory) here to avoid dynamic malloc/free?

NeoZhangJianyu · 2025-04-08T02:27:41Z

Yes, need to add such comment to record the reason.

This function will be called during load model from disk.
Use memory buffer replace dynamic won't save more time.

It won't impact the inference speed.
But it will increase the risk - we don't know when to release the buffer. Maybe not release at all.

If load a new model, maybe the buffer size is smaller, need to realloc buffer again. a little complex.

zhouwg · 2025-04-08T02:30:07Z

your explanation is very clear, thanks so much.

now I clearly know why a dynamic malloc/free which as a workaround approach was used here.

NeoZhangJianyu · 2025-04-08T02:33:40Z

Thank you very much!
I create the revert PR and add the note for it:
#12812

zhouwg · 2025-04-08T02:37:25Z

Thank you very much! I create the revert PR and add the note for it: #12812

thanks so much. sorry for that I brought troubles for you and your team.

add following comments which you mentioned in this function will be/might-be more clear to other developers(although the following comments might-be redundant in that function, for example: slaren knows that deeply):

This function will be called during load model from disk.
Use memory buffer replace dynamic won't save more time and brings potential memory leak risk here.

NeoZhangJianyu · 2025-04-08T05:16:41Z

Yes, update the PR with above comment.

characharm · 2025-04-08T08:55:33Z

Actually, it seems like Intel has been tweaking something in their driver related to memory handling. On the second-to-last driver version, I was getting a lot of OOM events, PC freezes, and black screens with missing rendered elements. This happened with the same llama.cpp settings and models that used to work fine before, regardless of the backend. With the latest driver, the crashes seem to be gone. I'm only referring to Windows. I tested SYCL yesterday with the latest release — didn’t test extensively, but everything seemed to work.

NeoZhangJianyu · 2025-04-08T10:39:01Z

The driver include stable/LTS and rolling versions.
You could use the LTS version to avoid above abnormal case.

zhouwg · 2025-04-10T05:20:48Z

I don't know the details. Memory mapped files rely on page faults to allocate physical memory and fill it with the contents of the file on demand, my guess is that the driver is accessing the memory in a way that interferes with this mechanism.

I think you know the details. you are such a master of humor and highly-speed and widely-knowledge computer and I know you can change the policy from Qualcomm(what you said last year really helped many developers whom using Qualcomm's SDK).

* master: (123 commits) cuda : add f32 to bf16 copy op (ggml-org#12806) llava: improve clip_ctx destructor to not memleak load_image_size (ggml-org#12834) llama : fix FA when KV cache is not used (i.e. embeddings) (ggml-org#12825) server : fix thread.join() on exit (ggml-org#12831) llava: add more helper functions to check projector types in clip context (ggml-org#12824) arg : Including limits file on AIX (ggml-org#12822) server : webui : Improve Chat Input with Auto-Sizing Textarea (ggml-org#12785) Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (ggml-org#12812) gguf-py : support lazy tensor splitting (ggml-org#12809) llama : Support llama 4 text-only (ggml-org#12791) opencl: better identify Adreno GPU (ggml-org#12760) hellaswag: display estimated score confidence interval (ggml-org#12797) cuda : fix HIP and MUSA BF16 (#0) sync : ggml ggml : simplify Arm fp16 CPU logic (ggml/1177) CUDA: don't convert BF16 weights to FP32 (ggml/1174) cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167) sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (ggml-org#12734) ci : no curl on ggml-ci (ggml-org#12796) cmake : enable curl by default (ggml-org#12761) ... # Conflicts: # common/arg.cpp # common/common.cpp # common/common.h

[SYCL]:remove redundant memcopy in function ggml_backend_sycl_buffer_…

ad3e8f1

…set_tensor

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 3, 2025

zhouwg changed the title ~~[SYCL]:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor~~ sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor Apr 3, 2025

Rbiessy reviewed Apr 3, 2025

View reviewed changes

NeoZhangJianyu approved these changes Apr 7, 2025

View reviewed changes

Rbiessy approved these changes Apr 7, 2025

View reviewed changes

Rbiessy merged commit 518a014 into ggml-org:master Apr 7, 2025
48 checks passed

NeoZhangJianyu mentioned this pull request Apr 8, 2025

Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" #12812

Merged

zhouwg mentioned this pull request Apr 8, 2025

fix memcpy() crash, add missed cmd in guide, fix softmax #6622

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor #12734

sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor #12734

zhouwg commented Apr 3, 2025 •

edited

Loading

Rbiessy Apr 3, 2025

Alcpz Apr 3, 2025

Rbiessy Apr 7, 2025

NeoZhangJianyu commented Apr 7, 2025

slaren commented Apr 7, 2025

zhouwg commented Apr 7, 2025

slaren commented Apr 7, 2025

zhouwg commented Apr 7, 2025 •

edited

Loading

slaren commented Apr 7, 2025

zhouwg commented Apr 7, 2025 •

edited

Loading

slaren commented Apr 7, 2025

zhouwg commented Apr 7, 2025 •

edited

Loading

slaren commented Apr 7, 2025

zhouwg commented Apr 8, 2025 •

edited

Loading

NeoZhangJianyu commented Apr 8, 2025

NeoZhangJianyu commented Apr 8, 2025

zhouwg commented Apr 8, 2025 •

edited

Loading

NeoZhangJianyu commented Apr 8, 2025 •

edited

Loading

zhouwg commented Apr 8, 2025

NeoZhangJianyu commented Apr 8, 2025

zhouwg commented Apr 8, 2025 •

edited

Loading

NeoZhangJianyu commented Apr 8, 2025

characharm commented Apr 8, 2025

NeoZhangJianyu commented Apr 8, 2025

zhouwg commented Apr 10, 2025 •

edited

Loading

		SYCL_CHECK(
		CHECK_TRY_ERROR(dpct::dev_mgr::instance().get_device(ctx->device).queues_wait_and_throw()));

sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor #12734

sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor #12734

Conversation

zhouwg commented Apr 3, 2025 • edited Loading

Rbiessy Apr 3, 2025

Choose a reason for hiding this comment

Alcpz Apr 3, 2025

Choose a reason for hiding this comment

Rbiessy Apr 7, 2025

Choose a reason for hiding this comment

NeoZhangJianyu commented Apr 7, 2025

slaren commented Apr 7, 2025

zhouwg commented Apr 7, 2025

slaren commented Apr 7, 2025

zhouwg commented Apr 7, 2025 • edited Loading

slaren commented Apr 7, 2025

zhouwg commented Apr 7, 2025 • edited Loading

slaren commented Apr 7, 2025

zhouwg commented Apr 7, 2025 • edited Loading

slaren commented Apr 7, 2025

zhouwg commented Apr 8, 2025 • edited Loading

NeoZhangJianyu commented Apr 8, 2025

NeoZhangJianyu commented Apr 8, 2025

zhouwg commented Apr 8, 2025 • edited Loading

NeoZhangJianyu commented Apr 8, 2025 • edited Loading

zhouwg commented Apr 8, 2025

NeoZhangJianyu commented Apr 8, 2025

zhouwg commented Apr 8, 2025 • edited Loading

NeoZhangJianyu commented Apr 8, 2025

characharm commented Apr 8, 2025

NeoZhangJianyu commented Apr 8, 2025

zhouwg commented Apr 10, 2025 • edited Loading

zhouwg commented Apr 3, 2025 •

edited

Loading

zhouwg commented Apr 7, 2025 •

edited

Loading

zhouwg commented Apr 7, 2025 •

edited

Loading

zhouwg commented Apr 7, 2025 •

edited

Loading

zhouwg commented Apr 8, 2025 •

edited

Loading

zhouwg commented Apr 8, 2025 •

edited

Loading

NeoZhangJianyu commented Apr 8, 2025 •

edited

Loading

zhouwg commented Apr 8, 2025 •

edited

Loading

zhouwg commented Apr 10, 2025 •

edited

Loading