Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 21 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,17 +78,34 @@ print(outputs[0]["generated_text"][-1])

#### vLLM

vLLM recommends using [`uv`](https://docs.astral.sh/uv/) for Python dependency management. You can use vLLM to spin up an OpenAI-compatible web server. The following command will automatically download the model and start the server.

**If your container/environment ALREADY HAS CUDA libraries pre-installed**:

```bash
uv pip install vllm==0.11.0 huggingface_hub[hf_transfer]==0.35.0 flashinfer-python==0.3.1
```

No extra steps required—vllm will detect your CUDA setup, and manage the correct torch version automatically.
Comment on lines +82 to +88

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Installing vLLM without CUDA wheel leaves torch CPU-only

The new "CUDA already installed" path tells users to run uv pip install vllm==0.11.0 ... with no extra index, claiming vLLM will “detect your CUDA setup and manage the correct torch version automatically”. PyPI only ships CPU-only torch wheels; without the download.pytorch.org index, pip installs the CPU build even when CUDA libraries are present. Launching vllm serve after this install fails with Torch not compiled with CUDA enabled or runs on CPU, defeating the purpose for most GPU containers. The docs should still instruct users to install a CUDA-enabled torch wheel (e.g. via the extra index or explicit torch==...+cuXXX).

Useful? React with 👍 / 👎.


**If your environment DOES NOT have CUDA libraries installed** (e.g., plain Ubuntu, minimal Python install, or a non-CUDA VM):

```bash
uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
uv pip install vllm==0.11.0 \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match
huggingface_hub[hf_transfer]==0.35.0 \
flashinfer-python==0.3.1
```

You may need to change `cu128` to match your system CUDA version (e.g., `cu121`, `cu118`, etc.).

**Serve the model:**

```bash
vllm serve openai/gpt-oss-20b
```

> **Tip:** For most cloud or Docker GPU setups, use the first install command (no extra index). If you encounter CUDA or torch import errors on a bare-metal system, use the second install command.

[Learn more about how to use gpt-oss with vLLM.](https://cookbook.openai.com/articles/gpt-oss/run-vllm)

Offline Serve Code:
Expand Down