Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does MInference supports CUDA11.8? #56

Open
hensiesp32 opened this issue Jul 29, 2024 · 4 comments
Open

Does MInference supports CUDA11.8? #56

hensiesp32 opened this issue Jul 29, 2024 · 4 comments
Assignees
Labels
question Further information is requested

Comments

@hensiesp32
Copy link

Describe the issue

I am wandering if the MInference support cuda11.8? Our devices don't support cuda12.3

@hensiesp32 hensiesp32 added the question Further information is requested label Jul 29, 2024
@iofu728 iofu728 self-assigned this Jul 30, 2024
@iofu728
Copy link
Contributor

iofu728 commented Jul 30, 2024

Hi @hensiesp32, thanks for your interest in MInference.

It supports CUDA 11.8. We have released the wheel for CUDA 11.8 at this link. If you have any questions, feel free to leave a comment here.

@hensiesp32
Copy link
Author

hensiesp32 commented Aug 1, 2024

Thanks for your reply. Well, I want to test the needle-in-a-haystack experiment, I only used one A100-80G,however when the contexts length reach to 300k,it occurred an OOM error. Then i open the kv_cache_cpu,but had the error

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions

so I want to know how do you test the needle-in-a-haystack with 1M context length? Or can we use multi-gpu to run it?

@hensiesp32
Copy link
Author

I run the experiment/benchmarks, but the result showed that MInference can't speed up. I used 4 A100-80G GPUs to get the results,The results is show as belowing:
image

@iofu728
Copy link
Contributor

iofu728 commented Aug 5, 2024

Hi @hensiesp32,

  1. For the benchmark test, the results don't seem to make sense, especially with streamingLLM. Did you use vllm for the measurements? Our experiments were conducted on a single A100 using HF or vllm, detail in https://github.com/microsoft/MInference/tree/main/experiments#minference-benchmark-experiments, and I've received some feedback that the corresponding kernel isn't replaced in multi-card setups. Could you test it on a single A100 for now? We will support multi-card mode in the future.

  2. When testing Needle In A Haystack, I used kv_cache_cpu for over 200K. However, this requires enough CPU memory on your machine—around 300GB for 1M.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants