ATOM (AiTer Optimized Model) is a lightweight vLLM-like implementation, focusing on integration and optimization based on aiter.
- ROCm Optimized: Built on AMD's ROCm platform with torch compile support
- Model Support: Compatible with Deepseek, Qwen, Llama, and Mixtral.
- Easy Integration: Simple API for quick deployment
- AMD GPU with ROCm support
- Docker
docker pull rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0docker run -it --network=host \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $HOME:/home/$USER \
-v /mnt:/mnt \
-v /data:/data \
--shm-size=16G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0pip install aiter -i https://mkmartifactory.amd.com/artifactory/api/pypi/hw-orc3pypi-prod-local/simple
git clone https://github.com/ROCm/ATOM.git
cd ./ATOM
pip install .The default optimization level is 3 (running with torch compile). Supported models include Deepseek, Qwen, Llama, and Mixtral.
python -m atom.examples.simple_inference --model meta-llama/Meta-Llama-3-8BNote: First-time execution may take approximately 10 minutes for model compilation.
Profile offline inference
python -m atom.examples.profile_offline --model Qwen/Qwen3-0.6BOr profile offline with custom input length
python -m atom.examples.profile_offline --model Qwen/Qwen3-0.6B --random-input --input-length 1024 --output-length 32Profile online inference, after starting the server
python -m atom.examples.profile_online Or profile online with custom input length
python -m atom.examples.profile_online --model Qwen/Qwen3-0.6B --random-input --input-length 1024 --output-length 32Or directly send start profile and stop profile reuqest
curl -s -S -X POST http://127.0.0.1:8000/start_profilecurl -s -S -X POST http://127.0.0.1:8000/stop_profileRun online throughput benchmark:
start the server
python -m atom.entrypoints.openai_server --model Qwen/Qwen3-0.6B
python -m atom.entrypoints.openai_server --model deepseek-ai/DeepSeek-R1 -tp 8 --block-size 1run benchmark
MODEL=deepseek-ai/DeepSeek-R1
ISL=1024
OSL=1024
CONC=128
PORT=8000
RESULT_FILENAME=Deepseek-R1-result
python benchmark_serving.py \
--model=$MODEL --backend=vllm --base-url=http://localhost:$PORT \
--dataset-name=random \
--random-input-len=$ISL --random-output-len=$OSL \
--random-range-ratio 0.8 \
--num-prompts=$(( $CONC * 10 )) \
--max-concurrency=$CONC \
--request-rate=inf --ignore-eos \
--save-result --percentile-metrics="ttft,tpot,itl,e2el" \
--result-dir=./ --result-filename=$RESULT_FILENAME.jsonATOM demonstrates significant performance improvements over vLLM:
| Model | Framework | Tokens | Time | Throughput |
|---|---|---|---|---|
| Qwen3-0.6B | ATOM | 4096 | 0.25s | 16,643.74 tok/s |
| Qwen3-0.6B | vLLM | 4096 | 0.63s | 6,543.06 tok/s |
| Llama-3.1-8B-Instruct-FP8-KV | ATOM | 4096 | 0.68s | 5,983.37 tok/s |
| Llama-3.1-8B-Instruct-FP8-KV | vLLM | 4096 | 1.68s | 2,432.62 tok/s |
Deepseek-V3
| concurrency | IPS/QPS | prompts num | vLLM Throughput | ATOM Throughput |
|---|---|---|---|---|
| 16 | 1024/1024 | 128 | 423.68 tok/s | 922.03 tok/s |
| 32 | 1024/1024 | 128 | 629.06 tok/s | 1488.52 tok/s |
| 64 | 1024/1024 | 128 | 760.22 tok/s | 2221.25 tok/s |
| 128 | 1024/1024 | 128 | 1107.93 tok/s | 2254.88 tok/s |
First, install lm-eval to test model accuracy:
pip install lm-eval[api]Next, start an OpenAI-compatible server using openai_server.py:
python -m atom.entrypoints.openai_server --model meta-llama/Meta-Llama-3-8BFinally, run the evaluation by choosing your datasets:
lm_eval --model local-completions \
--model_args model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v1/completions,num_concurrent=8,max_retries=3,tokenized_requests=False \
--tasks gsm8k \
--num_fewshot 3This project was adapted from nano-vllm (https://github.com/GeeeekExplorer/nano-vllm)
