Get the vLLM Performance Benchmark Suite running in 5 minutes.
# Check Python version (requires 3.10+)
python3 --version
# Check NVIDIA driver
nvidia-smi
# Check if vLLM server is accessible
curl http://localhost:8000/v1/modelsgit clone https://github.com/yourusername/vllm-benchmark-suite.git
cd vllm-benchmark-suiteUsing uv (recommended - fastest):
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create venv and install dependencies
uv venv --python 3.12
source .venv/bin/activate
uv pip install -r requirements.txtUsing pip:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt# Example: Start Qwen3-Next-80B
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
--port 8000 \
--max-model-len 262144 \
--gpu-memory-utilization 0.95python benchmark_qwen3_contexts_professional.pyExpected runtime: 30-45 minutes for full benchmark (40 test configurations)
[Progress: 1/40]
Testing: 1,000 token context | 1 concurrent users
===================================
Results:
Total time: 12.45s
Successful: 1/1
Latency Metrics:
Average: 12.45s
Throughput Metrics:
Tokens/second: 40.2
GPU Metrics:
Avg utilization: 85.3%
Avg memory used: 94215 MB (92.0 GB)
Three files are generated:
-
benchmark_results_TIMESTAMP.json
- Raw data for further analysis
- Import into pandas/Excel for custom analysis
-
benchmark_TIMESTAMP.png
- 12+ performance visualization charts
- Publication-ready at 300 DPI
-
Console Summary Tables
- Optimal configurations
- Peak performance metrics
- Context scaling analysis
Edit benchmark_qwen3_contexts_professional.py:
# Minimal test (4 configs, ~5 minutes)
context_lengths = [1000, 32000, 128000, 256000]
concurrent_users = [1]
# Balanced test (16 configs, ~15 minutes)
context_lengths = [1000, 32000, 64000, 128000, 192000, 256000]
concurrent_users = [1, 5]# Default: 500 tokens
output_tokens = 500
# Short responses (faster)
output_tokens = 100
# Long responses (stress test)
output_tokens = 2000API_BASE_URL = "http://localhost:8000" # default
# Remote server
API_BASE_URL = "http://192.168.1.100:8000"
# Different port
API_BASE_URL = "http://localhost:9000"Solution: Start vLLM server first
curl http://localhost:8000/v1/modelsSolution: Activate virtual environment and install dependencies
source venv/bin/activate
pip install -r requirements.txtSolution: Install NVIDIA drivers or add to PATH
export PATH=$PATH:/usr/binSolution: Increase timeout or reduce context length
REQUEST_TIMEOUT = 1800 # 30 minutes for very long contexts- Analyze Results: Open the generated PNG to review performance
- Optimize Configuration: Adjust vLLM settings based on findings
- Compare Models: Run benchmarks with different models
- Share Results: Post to community forums or research papers
- Check the full README.md for detailed documentation
- Review Troubleshooting section
- Open an issue on GitHub
- Join the vLLM Discord community
python benchmark_qwen3_contexts_professional.py# Edit script to set: concurrent_users = [1]
python benchmark_qwen3_contexts_professional.py# Edit script to set: concurrent_users = [10, 20, 50]
python benchmark_qwen3_contexts_professional.py# Terminal 1: Run benchmark
python benchmark_qwen3_contexts_professional.py
# Terminal 2: Watch GPU
watch -n 1 nvidia-smiEnjoy benchmarking!