-
Notifications
You must be signed in to change notification settings - Fork 11.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor #12734
Conversation
SYCL_CHECK( | ||
CHECK_TRY_ERROR(dpct::dev_mgr::instance().get_device(ctx->device).queues_wait_and_throw())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. I think this wait before copying should also not be needed. Looking at the other backends it seems to be llama's responsibility to ensure the data is available on the host before calling buffer_set_tensor
. Do you think we could remove that as part of this PR too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right, we discussed a similar issue in #12580 and the tensor memset API didn't need the wait. Memcpy should be equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging as is and we'll try to investigate the waits separately.
Yes, the host to host copy is not needed. I remember some wait() are mandatory in this cpp file. Remove them will lead to wrong result. |
IIRC, this copy existed to workaround an issue with Intel drivers reading data from a memory mapped file. |
is this copy related to some cache? |
No, when loading a model with mmap enabled the pointer received by this function may be a pointer within the memory mapped model file. Apparently the Intel driver was not able to copy data to VRAM from a memory mapped file, so this copy was added as a workaround. |
so this is my misunderstanding and there are some special tech details in sycl's internal. then we should revert this PR accordingly? necessary internal comments might-be/should-be added in this function if this PR should be reverted accordingly because I really don't understand why there is a host-to-host copy here before I submitted this PR. |
This is where the copy was added: #6622 |
you are a highly-speed and widely-knowledge computer and hope that bug has fixed in Intel driver. btw, could you take a moment to look at my PR about Qualcomm NPU and provide some guidance with that PR? |
I am sorry, I don't know anything about QNN or Qualcomm NPUs, and cannot give you any advice about that. From a quick look, the ggml-backend interface implementation looks ok. |
why you said "apparently the Intel driver was not able to copy data to (GPU) VRAM from a memory mapped file" when they are both in the same process address space? could you help to provide a more clear explanation? thanks so much! |
I don't know the details. Memory mapped files rely on page faults to allocate physical memory and fill it with the contents of the file on demand, my guess is that the driver is accessing the memory in a way that interferes with this mechanism. |
interesting problem: the pointer from mmap and malloc are both VA and both in the userspace address space, I don't understand why the driver can handle the pointer from malloc and can't handle the pointer from mmap at the same time? |
yes, this fix is created by me. The reason is the mmap() can't work well on Intel PVC GPU. This PR should be reverted. |
@zhouwg How do you think? |
No problem, I also agree we should revert this commit. mightbe we can add what you mentioned in this function: one more thing, can we use a pinned memory(a host CPU memory which intend to faster transfer between CPU system memory and device memory) here to avoid dynamic malloc/free? |
Yes, need to add such comment to record the reason. This function will be called during load model from disk. It won't impact the inference speed. If load a new model, maybe the buffer size is smaller, need to realloc buffer again. a little complex. |
your explanation is very clear, thanks so much. now I clearly know why a dynamic malloc/free which as a workaround approach was used here. |
Thank you very much! |
thanks so much. sorry for that I brought troubles for you and your team. add following comments which you mentioned in this function will be/might-be more clear to other developers(although the following comments might-be redundant in that function, for example: slaren knows that deeply): This function will be called during load model from disk. |
Yes, update the PR with above comment. |
Actually, it seems like Intel has been tweaking something in their driver related to memory handling. On the second-to-last driver version, I was getting a lot of OOM events, PC freezes, and black screens with missing rendered elements. This happened with the same llama.cpp settings and models that used to work fine before, regardless of the backend. With the latest driver, the crashes seem to be gone. I'm only referring to Windows. I tested SYCL yesterday with the latest release — didn’t test extensively, but everything seemed to work. |
The driver include stable/LTS and rolling versions. |
I think you know the details. you are such a master of humor and highly-speed and widely-knowledge computer and I know you can change the policy from Qualcomm(what you said last year really helped many developers whom using Qualcomm's SDK). |
* master: (123 commits) cuda : add f32 to bf16 copy op (ggml-org#12806) llava: improve clip_ctx destructor to not memleak load_image_size (ggml-org#12834) llama : fix FA when KV cache is not used (i.e. embeddings) (ggml-org#12825) server : fix thread.join() on exit (ggml-org#12831) llava: add more helper functions to check projector types in clip context (ggml-org#12824) arg : Including limits file on AIX (ggml-org#12822) server : webui : Improve Chat Input with Auto-Sizing Textarea (ggml-org#12785) Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (ggml-org#12812) gguf-py : support lazy tensor splitting (ggml-org#12809) llama : Support llama 4 text-only (ggml-org#12791) opencl: better identify Adreno GPU (ggml-org#12760) hellaswag: display estimated score confidence interval (ggml-org#12797) cuda : fix HIP and MUSA BF16 (#0) sync : ggml ggml : simplify Arm fp16 CPU logic (ggml/1177) CUDA: don't convert BF16 weights to FP32 (ggml/1174) cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167) sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (ggml-org#12734) ci : no curl on ggml-ci (ggml-org#12796) cmake : enable curl by default (ggml-org#12761) ... # Conflicts: # common/arg.cpp # common/common.cpp # common/common.h
1.the logic here should similar to the default ggml backend or exactly same to ggml-cann(copy data from host to device), so I think this memcopy is redundant and might-be brings some overhead in this function. of course, might-be my misunderstanding if there are some special tech details in sycl's internal.
2.I think we should avoid dynamic memory allocation/free in such performance-sensitive scenario.