Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor #12734

Merged
merged 1 commit into from
Apr 7, 2025

Conversation

zhouwg
Copy link
Contributor

@zhouwg zhouwg commented Apr 3, 2025

1.the logic here should similar to the default ggml backend or exactly same to ggml-cann(copy data from host to device), so I think this memcopy is redundant and might-be brings some overhead in this function. of course, might-be my misunderstanding if there are some special tech details in sycl's internal.
2.I think we should avoid dynamic memory allocation/free in such performance-sensitive scenario.

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 3, 2025
@zhouwg zhouwg changed the title [SYCL]:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor Apr 3, 2025
Comment on lines 373 to 374
SYCL_CHECK(
CHECK_TRY_ERROR(dpct::dev_mgr::instance().get_device(ctx->device).queues_wait_and_throw()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. I think this wait before copying should also not be needed. Looking at the other backends it seems to be llama's responsibility to ensure the data is available on the host before calling buffer_set_tensor. Do you think we could remove that as part of this PR too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, we discussed a similar issue in #12580 and the tensor memset API didn't need the wait. Memcpy should be equivalent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging as is and we'll try to investigate the waits separately.

@NeoZhangJianyu
Copy link
Collaborator

Yes, the host to host copy is not needed.

I remember some wait() are mandatory in this cpp file. Remove them will lead to wrong result.
Before remove the wait(), please test by UT to avoid the unexpected impact to functions.

@Rbiessy Rbiessy merged commit 518a014 into ggml-org:master Apr 7, 2025
48 checks passed
@slaren
Copy link
Member

slaren commented Apr 7, 2025

IIRC, this copy existed to workaround an issue with Intel drivers reading data from a memory mapped file.

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 7, 2025

IIRC, this copy existed to workaround an issue with Intel drivers reading data from a memory mapped file.

is this copy related to some cache?

@slaren
Copy link
Member

slaren commented Apr 7, 2025

No, when loading a model with mmap enabled the pointer received by this function may be a pointer within the memory mapped model file. Apparently the Intel driver was not able to copy data to VRAM from a memory mapped file, so this copy was added as a workaround.

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 7, 2025

No, when loading a model with mmap enabled the pointer received by this function may be a pointer within the memory mapped model file. Apparently the Intel driver was not able to copy data to VRAM from a memory mapped file, so this copy was added as a workaround.

so this is my misunderstanding and there are some special tech details in sycl's internal. then we should revert this PR accordingly?

necessary internal comments might-be/should-be added in this function if this PR should be reverted accordingly because I really don't understand why there is a host-to-host copy here before I submitted this PR.

@slaren
Copy link
Member

slaren commented Apr 7, 2025

This is where the copy was added: #6622
Maybe the Intel driver bug has been fixed and it is no longer necessary?

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 7, 2025

This is where the copy was added: #6622 Maybe the Intel driver bug has been fixed and it is no longer necessary?

you are a highly-speed and widely-knowledge computer and hope that bug has fixed in Intel driver.

btw, could you take a moment to look at my PR about Qualcomm NPU and provide some guidance with that PR?

@slaren
Copy link
Member

slaren commented Apr 7, 2025

I am sorry, I don't know anything about QNN or Qualcomm NPUs, and cannot give you any advice about that. From a quick look, the ggml-backend interface implementation looks ok.

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 7, 2025

No, when loading a model with mmap enabled the pointer received by this function may be a pointer within the memory mapped model file. Apparently the Intel driver was not able to copy data to VRAM from a memory mapped file, so this copy was added as a workaround.

why you said "apparently the Intel driver was not able to copy data to (GPU) VRAM from a memory mapped file" when they are both in the same process address space? could you help to provide a more clear explanation? thanks so much!

@slaren
Copy link
Member

slaren commented Apr 7, 2025

I don't know the details. Memory mapped files rely on page faults to allocate physical memory and fill it with the contents of the file on demand, my guess is that the driver is accessing the memory in a way that interferes with this mechanism.

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 8, 2025

interesting problem: the pointer from mmap and malloc are both VA and both in the userspace address space, I don't understand why the driver can handle the pointer from malloc and can't handle the pointer from mmap at the same time?

cupy/cupy#3431

@NeoZhangJianyu
Copy link
Collaborator

yes, this fix is created by me.
I forget it.

The reason is the mmap() can't work well on Intel PVC GPU.
So, we have to use memcpy from mmap to host buffer, then host buffer to device.

This PR should be reverted.

@NeoZhangJianyu
Copy link
Collaborator

@zhouwg
Sorry, I suggest reverting this PR.

How do you think?

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 8, 2025

No problem, I also agree we should revert this commit.

mightbe we can add what you mentioned in this function:
"The reason is the mmap() can't work well on Intel PVC GPU.
So, we have to use memcpy from mmap to host buffer, then host buffer to device."

one more thing, can we use a pinned memory(a host CPU memory which intend to faster transfer between CPU system memory and device memory) here to avoid dynamic malloc/free?

@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Apr 8, 2025

Yes, need to add such comment to record the reason.

This function will be called during load model from disk.
Use memory buffer replace dynamic won't save more time.

It won't impact the inference speed.
But it will increase the risk - we don't know when to release the buffer. Maybe not release at all.

If load a new model, maybe the buffer size is smaller, need to realloc buffer again. a little complex.

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 8, 2025

your explanation is very clear, thanks so much.

now I clearly know why a dynamic malloc/free which as a workaround approach was used here.

@NeoZhangJianyu
Copy link
Collaborator

Thank you very much!
I create the revert PR and add the note for it:
#12812

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 8, 2025

Thank you very much! I create the revert PR and add the note for it: #12812

thanks so much. sorry for that I brought troubles for you and your team.

add following comments which you mentioned in this function will be/might-be more clear to other developers(although the following comments might-be redundant in that function, for example: slaren knows that deeply):

This function will be called during load model from disk.
Use memory buffer replace dynamic won't save more time and brings potential memory leak risk here.

@NeoZhangJianyu
Copy link
Collaborator

Yes, update the PR with above comment.

@characharm
Copy link
Contributor

Actually, it seems like Intel has been tweaking something in their driver related to memory handling. On the second-to-last driver version, I was getting a lot of OOM events, PC freezes, and black screens with missing rendered elements. This happened with the same llama.cpp settings and models that used to work fine before, regardless of the backend. With the latest driver, the crashes seem to be gone. I'm only referring to Windows. I tested SYCL yesterday with the latest release — didn’t test extensively, but everything seemed to work.

@NeoZhangJianyu
Copy link
Collaborator

The driver include stable/LTS and rolling versions.
You could use the LTS version to avoid above abnormal case.

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 10, 2025

I don't know the details. Memory mapped files rely on page faults to allocate physical memory and fill it with the contents of the file on demand, my guess is that the driver is accessing the memory in a way that interferes with this mechanism.

I think you know the details. you are such a master of humor and highly-speed and widely-knowledge computer and I know you can change the policy from Qualcomm(what you said last year really helped many developers whom using Qualcomm's SDK).

tastelikefeet added a commit to tastelikefeet/llama.cpp that referenced this pull request Apr 10, 2025
* master: (123 commits)
  cuda : add f32 to bf16 copy op (ggml-org#12806)
  llava: improve clip_ctx destructor to not memleak load_image_size (ggml-org#12834)
  llama : fix FA when KV cache is not used (i.e. embeddings) (ggml-org#12825)
  server : fix thread.join() on exit (ggml-org#12831)
  llava: add more helper functions to check projector types in clip context (ggml-org#12824)
  arg : Including limits file on AIX (ggml-org#12822)
  server : webui : Improve Chat Input with Auto-Sizing Textarea (ggml-org#12785)
  Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (ggml-org#12812)
  gguf-py : support lazy tensor splitting (ggml-org#12809)
  llama : Support llama 4 text-only (ggml-org#12791)
  opencl: better identify Adreno GPU (ggml-org#12760)
  hellaswag: display estimated score confidence interval (ggml-org#12797)
  cuda : fix HIP and MUSA BF16 (#0)
  sync : ggml
  ggml : simplify Arm fp16 CPU logic (ggml/1177)
  CUDA: don't convert BF16 weights to FP32 (ggml/1174)
  cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167)
  sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (ggml-org#12734)
  ci : no curl on ggml-ci (ggml-org#12796)
  cmake : enable curl by default (ggml-org#12761)
  ...

# Conflicts:
#	common/arg.cpp
#	common/common.cpp
#	common/common.h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants