Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Initial Support for CPUs #3654

Open
1 of 4 tasks
bigPYJ1151 opened this issue Mar 27, 2024 · 9 comments
Open
1 of 4 tasks

[RFC] Initial Support for CPUs #3654

bigPYJ1151 opened this issue Mar 27, 2024 · 9 comments

Comments

@bigPYJ1151
Copy link
Contributor

bigPYJ1151 commented Mar 27, 2024

Progress

Features

The CPU executor plans to support the following features:

  • Basic models of vLLM with FP16/BF16/FP32, except MoE models
  • Tensor-parallel model inference based on Ray
  • AWQ quantization, 8-bit KVCache Quantization
  • Others

Design

Our target is seamless porting vLLM to CPU devices and sharing most of vLLM core components (e.g., schedular, cache management, model definitions, Megatron-style model partitioning, ...).

The CPU executor will depend on Pytorch CPU and leverage optimized kernels and features from intel-extension-for-pytorch.

The main changes to vLLM include:

Torch APIs Adaption

CPU device is supported in PyTorch by default. It allows the CPU Executor to share the same model definitions with the GPU Executor. Thanks to recent code refactors, many hardcoded cuda device flags have been removed and Torch APIs are dispatched based on the device flag from DeviceConfig. For the CPU executor, a new cpu device flag is added.

Sharing the same model definitions and Torch APIs also allows the CPU executor to easily support new models and features in vLLM (e.g., torch.compile).

Custom Ops Adaption

vLLM implemented many efficient CUDA kernels and packaged as _C library by pybind. These kernels are ported to CPU using C++ and OpenMP, with the same function signatures to replace the CUDA kernels directly. The CPU custom kernel building procedure is integrated into vLLM CMake build system as a CMake module.

Currently, all of CPU kernels require AVX512 ISA support.

Python APIs Adaption

New CPUExecutor and CPUWorker are added to initialize the environment and model runner. The CPUModelRunner is derived from ModelRunner of the GPU code path, because most of the code could be shared. Even though it might have potential risks due to changes in the GPU code path, CPUModelRunner could fix them by rewriting configurations or overloading member functions easily.

In special, different from the GPU executor profiling available KV cache memory, the cache memory in the CPU executor is specified by the swap_space parameter. Because the memory management of CPU is more complex than GPU (e.g., NUMA).

@hyperbolic-c
Copy link

Thanks for yout excellent work! Looking forward to supporting inference on ARM CPU. Further, support for ray distributed computing
.

@hiGiraffe
Copy link

could you give me a cpu inference example?
I try

//start
python3 -m vllm.entrypoints.openai.api_server \
--device cpu

//input
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
        "model": "facebook/opt-125m",
        "messages": [
            {"role": "system", "content": "You are an intelligent British female writer and translator who is good at writing science fiction using multiple languages. You won a Nobel price in literature five years ago."},
            {"role": "user", "content": "Please detailedly tell a story about an exciting aerospace expedition for a Chinese boy Lam and his German dog. They are sent to aerospace by mistake and strive to wait for rescue from motherland with no water and food supply for over a month. They are almost caught by aliens disguised as his mother. Moreover, please translate the above story to Chinese, German, French, Portuguese and Japanese respectively."}
        ], "temperature": 0
    }'

But I got error. Are there any engine arguments that need to be added here?

@mgiessing
Copy link

@bigPYJ1151 Are you planning to support AVX/AVX2 to enable a broader range of Intel/x86 CPUs?

@bigPYJ1151
Copy link
Contributor Author

Hi @mgiessing, it is not in our plan right now, but we may add it after the basic features finished.

@kannon92
Copy link

kannon92 commented Jun 4, 2024

Could you help with #4415?

I was trying to compile cool with intel compiler but I had some issues and I think I almost have it working.

@hmellor
Copy link
Collaborator

hmellor commented Aug 6, 2024

Hi @bigPYJ1151,

I'd like to ask why the initial CPU support defines device specific vector types in https://github.com/vllm-project/vllm/blob/main/csrc/cpu/cpu_types_x86.hpp?

PyTorch contains a vector type Vectorized that appears to serve the same purpose, while also being architecture agnostic. Could the custom ops for CPU switch to using this PyTorch type to make the CPU backend architecture agnostic? (i.e. PowerPC, AArch64, etc.)

@bigPYJ1151
Copy link
Contributor Author

bigPYJ1151 commented Aug 7, 2024

Hi @hmellor

Yes, Pytorch contains such vector structures, and it is feasible to use them in the CPU backend. I didn't aware them before so defined the custom types🤣. vLLM is adapting torch compile and some custom ops will be generated by JIT, so the number of custom types will be very limited after we clean them. Then we can try to replace them with the Pytorch vectors.

@hmellor
Copy link
Collaborator

hmellor commented Aug 7, 2024

That's great to hear! Is it just #7110 that we're waiting for, or are there other PRs?

@bigPYJ1151
Copy link
Contributor Author

Yes, after the #7110 I think we can do some code refactors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants