This document summarizes the successful deployment and testing of AMD Inference Microservice (AIM) on an AMD Instinct MI300X GPU system using Docker, following the walkthrough from the AMD ROCm blog.
This repository provides deployment guides for both Docker and Kubernetes:
- Docker Deployment - Single-node Docker deployment (this guide)
- Kubernetes Deployment - Production-ready Kubernetes deployment with KServe (Kubernetes Deployment Guide)
- GPU: AMD Developer Cloud AMD Instinct MI300X VF
- GPU ID: 0x74b5
- VRAM: 196GB total
- GFX Version: gfx942
- OS: Ubuntu (Linux 6.8.0-87-generic)
- ROCm: Installed and functional
- Docker: Version 29.0.2
- Container Image:
amdenterpriseai/aim-qwen-qwen3-32b:0.8.4
Quick Start: For automated validation, use the provided script:
chmod +x validate-aim-prerequisites.sh
./validate-aim-prerequisites.shThis script automates Steps 1.1-1.9 and provides fix instructions for any failed checks.
Manual Validation: This section provides comprehensive step-by-step validation to ensure your CSP node is ready for AIM deployment. Perform each check in order and verify the expected outputs before proceeding.
Command:
uname -aExpected Output:
Linux <hostname> <kernel-version> #<build> <distro> <date> <arch> x86_64 x86_64 x86_64 GNU/Linux
What to Check:
- System is Linux-based (Ubuntu, RHEL, or similar)
- Architecture is
x86_64 - Kernel version is recent (5.15+ recommended for MI300X)
Example Output:
Linux 7 6.8.0-87-generic #88-Ubuntu SMP PREEMPT_DYNAMIC Sat Oct 11 09:28:41 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Troubleshooting:
- If architecture is not x86_64, verify you're on the correct node type
- If kernel is very old, consider updating (may require CSP support)
Command:
docker --versionExpected Output:
Docker version <version>, build <build-id>
What to Check:
- Docker is installed
- Version is 20.10+ (recommended: 24.0+)
Example Output:
Docker version 29.0.2, build 8108357
Additional Docker Checks:
Test Docker daemon:
docker psExpected Output:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
(May be empty if no containers running - this is fine)
Verify Docker can run containers:
docker run --rm hello-worldExpected Output:
Hello from Docker!
This message shows that your installation appears to be working correctly.
...
Troubleshooting:
- If
docker: command not found, install Docker:# Ubuntu/Debian sudo apt-get update sudo apt-get install docker.io sudo systemctl start docker sudo systemctl enable docker
- If
permission denied, add user to docker group or usesudo - If daemon not running:
sudo systemctl start docker
Command:
rocm-smi --versionExpected Output:
ROCM-SMI version: <version>
ROCM-SMI-LIB version: <version>
What to Check:
- ROCm is installed
- Version is 6.0+ (required for MI300X)
Example Output:
ROCM-SMI version: 4.0.0+4179531dcd
ROCM-SMI-LIB version: 7.8.0
Verify ROCm can detect GPUs:
rocm-smiExpected Output:
============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Junction) (Socket) (Mem, Compute, ID)
==========================================================================================================================
0 1 0x<device-id>, <temp>°C <power>W <partitions> <freq> <freq> <fan>% auto <cap>W <vram>% <gpu>%
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================
What to Check:
- At least one GPU device is listed
- Device ID is present (e.g.,
0x74b5for MI300X) - Temperature and power readings are reasonable
- VRAM% shows available memory
Get Detailed GPU Information:
rocm-smi --showproductnameExpected Output:
============================ ROCm System Management Interface ============================
====================================== Product Info ======================================
GPU[0] : Card Series: AMD Instinct MI300X VF
GPU[0] : Card Model: 0x74b5
GPU[0] : Card Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0] : Card SKU: M3000100
GPU[0] : Subsystem ID: 0x74a1
GPU[0] : Device Rev: 0x00
GPU[0] : Node ID: 1
GPU[0] : GUID: <guid>
GPU[0] : GFX Version: gfx942
==========================================================================================
================================== End of ROCm SMI Log ===================================
What to Check:
- Card Series shows "AMD Instinct MI300X" (or compatible model)
- GFX Version is
gfx942or compatible (gfx940, gfx941, gfx942 for MI300X) - Device ID matches expected values
Troubleshooting:
- If
rocm-smi: command not found, ROCm is not installed. Contact CSP support or install ROCm - If no GPUs detected, verify:
- GPU is properly installed
- ROCm drivers are loaded:
lsmod | grep amdgpu - Check kernel messages:
dmesg | grep -i amd
- If GPU shows 0% VRAM or errors, GPU may be in low-power state (this is often normal when idle)
Command:
ls -la /dev/kfd /dev/dri/Expected Output:
crw-rw---- 1 root render 238, 0 <date> /dev/kfd
/dev/dri/:
total 0
drwxr-xr-x 3 root root <size> <date> .
drwxr-xr-x 18 root root <size> <date> ..
drwxr-xr-x 2 root root <size> <date> by-path
crw-rw---- 1 root video 226, 0 <date> card0
crw-rw---- 1 root video 226, 1 <date> card1
...
crw-rw---- 1 root render 226, 128 <date> renderD128
crw-rw---- 1 root render 226, 129 <date> renderD129
...
What to Check:
/dev/kfdexists and is a character device (starts withc)/dev/dri/directory exists- Multiple
card*devices exist (one per GPU partition) - Multiple
renderD*devices exist (one per render node) - Permissions show
root renderfor/dev/kfdandroot videoforcard*
Verify Device Accessibility:
test -r /dev/kfd && echo "âś“ /dev/kfd is readable" || echo "âś— /dev/kfd is NOT readable"
test -r /dev/dri/card0 && echo "âś“ /dev/dri/card0 is readable" || echo "âś— /dev/dri/card0 is NOT readable"Expected Output:
âś“ /dev/kfd is readable
âś“ /dev/dri/card0 is readable
Troubleshooting:
- If
/dev/kfddoesn't exist:- ROCm may not be properly installed
- Kernel module may not be loaded:
sudo modprobe kfd
- If devices exist but aren't readable:
- Check permissions:
ls -l /dev/kfd /dev/dri/card* - Add user to
renderandvideogroups:sudo usermod -aG render,video $USER - Log out and back in, or use
newgrp renderandnewgrp video
- Check permissions:
Command:
id
groupsExpected Output:
uid=<uid>(<username>) gid=<gid>(<group>) groups=<gid>(<group>),<gid>(render),<gid>(video),...
What to Check:
- User is in
rendergroup (for/dev/kfdaccess) - User is in
videogroup (for/dev/dri/card*access) - If running as root, these checks may not apply (root has access)
Example Output (non-root user):
uid=1000(user) gid=1000(user) groups=1000(user),27(sudo),107(render),44(video)
If Groups Are Missing:
# Add user to required groups
sudo usermod -aG render,video $USER
# Verify (requires new login session)
groupsTroubleshooting:
- If not in required groups, add them and start a new shell session
- If running as root, permissions should be fine, but verify device access
Check Available Memory:
free -hExpected Output:
total used free shared buff/cache available
Mem: <size>G <used>G <free>G <shared>G <cache>G <avail>G
Swap: <size>G <used>G <free>G
What to Check:
- At least 64GB RAM available (128GB+ recommended for 32B models)
- Sufficient free memory for model loading
Example Output:
total used free shared buff/cache available
Mem: 235Gi 15Gi 110Gi 4.6Mi 112Gi 220Gi
Swap: 0B 0B 0B
Check Disk Space:
df -h /Expected Output:
Filesystem Size Used Avail Use% Mounted on
<fs> <size>G <used>G <avail>G <use>% /
What to Check:
- At least 100GB free space (200GB+ recommended)
- Model weights can be 60GB+ for large models
Example Output:
/dev/vda1 697G 181G 516G 26% /
Troubleshooting:
- If memory is insufficient, consider smaller models or increase node size
- If disk space is low, clean up or request larger storage
Command:
ping -c 3 8.8.8.8Expected Output:
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=... time=... ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=... time=... ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=... time=... ms
--- 8.8.8.8 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time ...
What to Check:
- Internet connectivity is working
- Required for downloading model weights and container images
Test Docker Hub Access:
curl -I https://hub.docker.comExpected Output:
HTTP/2 200
...
Troubleshooting:
- If ping fails, check network configuration
- If Docker Hub is blocked, configure registry mirrors or use alternative registries
- Some CSPs require proxy configuration
Command:
netstat -tuln | grep 8000 || ss -tuln | grep 8000Expected Output:
(No output - port is free)
OR if port is in use:
tcp 0 0 0.0.0.0:8000 0.0.0.0:* LISTEN
What to Check:
- Port 8000 is available (or choose a different port)
- If port is in use, either stop the service or use
-p <different-port>:8000
Troubleshooting:
- If port is in use, find the process:
sudo lsof -i :8000orsudo netstat -tulpn | grep 8000 - Stop the conflicting service or use a different port
Before proceeding to deployment, verify all items:
- Operating system is Linux x86_64
- Docker is installed and working (
docker --versionsucceeds) - Docker daemon is running (
docker psworks) - ROCm is installed (
rocm-smi --versionsucceeds) - GPU is detected (
rocm-smishows at least one GPU) - GPU model is compatible (MI300X, MI325X, or similar)
-
/dev/kfdexists and is readable -
/dev/dri/card*devices exist and are readable - User has
renderandvideogroup membership (or running as root) - Sufficient RAM available (64GB+ recommended)
- Sufficient disk space (100GB+ free)
- Network connectivity works
- Port 8000 is available (or alternative port chosen)
If all checks pass, proceed to Step 2: AIM Container Deployment.
Command:
cd ~
git clone https://github.com/amd-enterprise-ai/aim-deploy.git
cd aim-deployExpected Output:
Cloning into 'aim-deploy'...
remote: Enumerating objects: <n>, done.
remote: Counting objects: 100% (<n>/<n>), done.
remote: Compressing objects: 100% (<n>/<n>), done.
remote: Total <n> (delta <n>), reused <n> (delta <n>), pack-reused <n>
Receiving objects: 100% (<n>/<n>), <size> KiB | <speed> MiB/s, done.
Resolving deltas: 100% (<n>/<n>), done.
What to Check:
- Repository cloned successfully
- No error messages
Verify Repository Contents:
ls -la aim-deploy/Expected Output:
total <size>
drwxr-xr-x <n> root root <size> <date> .
drwxr-xr-x <n> root root <size> <date> ..
drwxr-xr-x <n> root root <size> <date> .git
-rw-r--r-- <n> root root <size> <date> .gitignore
-rw-r--r-- <n> root root <size> <date> LICENSE
-rw-r--r-- <n> root root <size> <date> README.md
drwxr-xr-x <n> root root <size> <date> k8s
drwxr-xr-x <n> root root <size> <date> kserve
Troubleshooting:
- If
git: command not found, install git:sudo apt-get install git(Ubuntu/Debian) - If clone fails, check network connectivity and GitHub access
Command:
docker pull amdenterpriseai/aim-qwen-qwen3-32b:0.8.4Expected Output:
0.8.4: Pulling from amdenterpriseai/aim-qwen-qwen3-32b
<layer-id>: Pulling fs layer
<layer-id>: Pull complete
...
Digest: sha256:<hash>
Status: Downloaded newer image for amdenterpriseai/aim-qwen-qwen3-32b:0.8.4
docker.io/amdenterpriseai/aim-qwen-qwen3-32b:0.8.4
What to Check:
- Image pull completes without errors
- Final status shows "Downloaded newer image" or "Image is up to date"
- Digest is shown (for verification)
Verify Image is Available:
docker images | grep aim-qwenExpected Output:
amdenterpriseai/aim-qwen-qwen3-32b 0.8.4 <image-id> <size> <time-ago>
What to Check:
- Image is listed with correct tag
- Image size is reasonable (several GB)
Troubleshooting:
- If pull fails with "unauthorized", check Docker Hub access
- If pull is slow, check network bandwidth
- If pull fails with "no space", free up disk space:
docker system prune
Before deploying, test that AIM can detect your hardware:
Command:
docker run --rm --device=/dev/kfd --device=/dev/dri \
--security-opt seccomp=unconfined \
--group-add video \
--ipc=host \
--shm-size=8g \
amdenterpriseai/aim-qwen-qwen3-32b:0.8.4 list-profilesExpected Output:
2025-<date> - aim_runtime.gpu_detector - INFO - Detected <n> AMD GPU(s)
2025-<date> - aim_runtime.gpu_detector - INFO - GPU <device-id>: {
"device_id": "<id>",
"model": "MI300X",
"vram_total": <size>,
"vram_used": <size>,
"vram_free": <size>,
...
}
...
AIM Profile Compatibility Report
====================================================================================================
...
| Profile | GPU | Precision | Engine | TP | Metric | Type | Priority | Manual Only | Compatibility |
|-----------------------------------------|--------|-----------|--------|----|------------|-------------|----------|-------------|-----------------|
| vllm-mi300x-fp16-tp1-latency | MI300X | fp16 | vllm | 1 | latency | optimized | 2 | No | compatible |
...
What to Check:
- GPU is detected ("Detected AMD GPU(s)")
- GPU model is identified correctly (MI300X, MI325X, etc.)
- VRAM information is shown
- At least one profile shows "compatible" status
- No critical errors in output
Troubleshooting:
- If "Detected 0 AMD GPU(s)", verify GPU device access in container
- If GPU model is not recognized, check if it's a supported model
- If profiles show "gpu_mismatch", verify GPU model compatibility
Command:
docker run -d --name aim-qwen3-32b \
-e PYTHONUNBUFFERED=1 \
--device=/dev/kfd \
--device=/dev/dri \
--security-opt seccomp=unconfined \
--group-add video \
--ipc=host \
--shm-size=8g \
-p 8000:8000 \
amdenterpriseai/aim-qwen-qwen3-32b:0.8.4 serveExpected Output:
<container-id>
What to Check:
- Container ID is returned (long hexadecimal string)
- No error messages
Verify Container is Running:
docker ps | grep aim-qwen3-32bExpected Output:
<container-id> amdenterpriseai/aim-qwen-qwen3-32b:0.8.4 "./entrypoint.py ser…" <time> ago Up <time> 0.0.0.0:8000->8000/tcp, [::]:8000->8000/tcp aim-qwen3-32b
What to Check:
- Container is listed
- Status shows "Up "
- Port mapping shows
0.0.0.0:8000->8000/tcp
Troubleshooting:
- If container exits immediately, check logs:
docker logs aim-qwen3-32b - If port binding fails, check if port 8000 is already in use
- If container fails to start, verify all device paths exist
Check Container Logs:
docker logs -f aim-qwen3-32bExpected Output (Initial):
2025-<date> - aim_runtime.gpu_detector - INFO - Detected 1 AMD GPU(s)
2025-<date> - aim_runtime.profile_selector - INFO - Selected profile: .../vllm-mi300x-fp16-tp1-latency.yaml
2025-<date> - aim_runtime.aim_runtime - INFO - --- Setting Environment Variables ---
...
INFO 11-26 <time> [api_server.py:1885] vLLM API server version <version>
INFO 11-26 <time> [__init__.py:742] Resolved architecture: Qwen3ForCausalLM
INFO 11-26 <time> [gpu_model_runner.py:1932] Starting to load model Qwen/Qwen3-32B...
What to Check:
- GPU detection succeeds
- Profile is selected
- vLLM API server starts
- Model loading begins
Expected Output (During Model Loading):
Loading safetensors checkpoint shards: 0% Completed | 0/17 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 6% Completed | 1/17 [00:02<00:33, 2.12s/it]
Loading safetensors checkpoint shards: 12% Completed | 2/17 [00:04<00:34, 2.29s/it]
...
What to Check:
- Model shards are loading (progress increases)
- No errors during loading
Expected Output (When Ready):
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
What to Check:
- "Application startup complete" message appears
- Server is ready to accept requests
Troubleshooting:
- If model download fails, check network connectivity
- If loading is very slow, verify GPU memory is sufficient
- If errors occur, check full logs:
docker logs aim-qwen3-32b 2>&1 | tail -50
Monitor GPU Usage During Loading:
watch -n 2 'rocm-smi --showmemuse | grep -A 2 "GPU\[0\]"'Expected Behavior:
- VRAM usage increases as model loads
- Eventually reaches 85-95% for 32B model on single GPU
Check Container Status:
docker ps -a | grep aim-qwen3-32bExpected Output:
<container-id> ... Up <time> ... aim-qwen3-32b
What to Check:
- Status is "Up" (not "Exited" or "Restarting")
- Uptime is reasonable
Check Resource Usage:
docker stats aim-qwen3-32b --no-streamExpected Output:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
<id> aim-qwen3-32b <cpu>% <mem>GiB / <limit>GiB <pct>% <net> <block> <n>
What to Check:
- Memory usage is reasonable (several GB)
- CPU usage may be high during model loading, then lower when idle
- Container is not using excessive resources
Troubleshooting:
- If container is "Exited", check exit code:
docker inspect aim-qwen3-32b | grep -A 5 State - If container is "Restarting", check logs for errors
- If memory usage is very high, verify model size matches available resources
- AIM automatically detected the MI300X GPU
- Selected optimal profile:
vllm-mi300x-fp16-tp1-latency(optimized, FP16 precision) - Found 3 compatible profiles for the hardware configuration
- Total of 24 profiles analyzed (16 for MI300X, 8 for MI325X)
- Automatically detected 1 AMD GPU
- Identified GPU model: MI300X
- Detected VRAM: 196GB total, 196GB free at startup
- Configured for gfx942 architecture
AIM automatically set the following optimization variables:
GPU_ARCHS=gfx942HSA_NO_SCRATCH_RECLAIM=1VLLM_USE_AITER_TRITON_ROPE=1VLLM_ROCM_USE_AITER=1VLLM_ROCM_USE_AITER_RMSNORM=1
- Model: Qwen/Qwen3-32B (32B parameters)
- Download time: ~14.7 seconds
- Loading: 17 checkpoint shards loaded successfully
- GPU memory usage: 91% VRAM allocated during operation
- Total startup time: ~2.5 minutes
Quick Start: For automated validation, use the provided script:
chmod +x validate-aim-inference.sh
./validate-aim-inference.shThis script automates Steps 5.1-5.8 and provides fix instructions for any failed checks.
Manual Validation: This section provides step-by-step validation of inference functionality. Perform each check in order and verify the expected outputs before proceeding.
Wait for "Application startup complete" in logs:
docker logs aim-qwen3-32b 2>&1 | grep -i "application startup complete"Expected Output:
INFO: Application startup complete.
What to Check:
- Message appears in logs
- No errors after this message
Alternative: Check if server responds:
timeout 5 curl -s http://localhost:8000/health || echo "Server not ready yet"Expected Output (when ready):
(May return empty or JSON response)
OR (if not ready):
Server not ready yet
Troubleshooting:
- If server doesn't become ready after 5-10 minutes, check logs for errors
- If port is not accessible, verify port mapping:
docker ps | grep 8000
Command:
curl -v http://localhost:8000/healthExpected Output:
* Trying 127.0.0.1:8000...
* Connected to localhost (127.0.0.1) port 8000
< HTTP/1.1 200 OK
< ...
What to Check:
- HTTP status is 200 OK
- Connection succeeds
Troubleshooting:
- If connection refused, verify container is running and port is mapped
- If 404, endpoint may not be available (this is okay, try
/v1/modelsinstead)
Command:
curl -s http://localhost:8000/v1/models | python3 -m json.toolExpected Output:
{
"object": "list",
"data": [
{
"id": "Qwen/Qwen3-32B",
"object": "model",
"created": <timestamp>,
"owned_by": "vllm",
"root": "Qwen/Qwen3-32B",
"parent": null,
"max_model_len": 32768,
"permission": [...]
}
]
}What to Check:
- JSON response is valid
- Model ID matches expected model (Qwen/Qwen3-32B)
max_model_lenis shown (32768 for Qwen3-32B)
Alternative (without python):
curl -s http://localhost:8000/v1/modelsExpected Output:
{"object":"list","data":[{"id":"Qwen/Qwen3-32B",...}]}
Troubleshooting:
- If connection fails, check container status:
docker ps | grep aim - If JSON is malformed, server may still be starting
- If model ID doesn't match, verify correct container image was used
Important Note for Qwen3: Qwen3-32B uses a reasoning/thinking process before generating responses. This can make it appear slow or unresponsive. Use streaming to see progress, and set higher max_tokens to allow complete thinking + response.
Command (with streaming - recommended token allocation):
curl -s -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-32B",
"messages": [
{"role": "user", "content": "What are the key advantages of using GPUs for AI inference, and how do they compare to CPUs?"}
],
"max_tokens": 2048,
"stream": true,
"temperature": 0.7
}'Note: Using -s flag to suppress curl progress output. Using max_tokens: 2048 to ensure thinking completes and response is generated.
Expected Output (Raw Streaming - use -s flag to suppress curl progress):
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":...,"model":"Qwen/Qwen3-32B","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":...,"model":"Qwen/Qwen3-32B","choices":[{"index":0,"delta":{"reasoning_content":"\n"},"finish_reason":null}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":...,"model":"Qwen/Qwen3-32B","choices":[{"index":0,"delta":{"reasoning_content":"Okay"},"finish_reason":null}]}
... (thinking process continues with reasoning_content) ...
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":...,"model":"Qwen/Qwen3-32B","choices":[{"index":0,"delta":{"content":"\n\n"},"finish_reason":null}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":...,"model":"Qwen/Qwen3-32B","choices":[{"index":0,"delta":{"content":"GPUs"},"finish_reason":null}]}
... (response continues with content) ...
data: [DONE]
Important: Qwen3 uses two fields:
reasoning_content: The thinking/reasoning process (appears first)content: The final response (appears after thinking)
Note: Without the -s flag, curl shows progress output that can interfere with parsing. Always use curl -s for streaming responses.
What to Check:
- Stream starts immediately (shows progress)
- Thinking process is visible (may include
<thinking>tags or reasoning text) - Final response follows after thinking
- Stream ends with
[DONE]
Process Streaming Response (Python example - shows both thinking and response):
curl -s -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-32B",
"messages": [{"role": "user", "content": "What are the key advantages of using GPUs for AI inference, and how do they compare to CPUs?"}],
"max_tokens": 2048,
"stream": true
}' | python3 -c "
import sys, json
for line in sys.stdin:
if line.startswith('data: '):
data = line[6:].strip()
if data == '[DONE]':
break
try:
chunk = json.loads(data)
if 'choices' in chunk and len(chunk['choices']) > 0:
delta = chunk['choices'][0].get('delta', {})
# Qwen3 uses reasoning_content for thinking, content for response
reasoning = delta.get('reasoning_content', '')
content = delta.get('content', '')
if reasoning:
print(reasoning, end='', flush=True)
if content:
print(content, end='', flush=True)
except:
pass
print() # Newline at end
"Process Streaming Response (shows only final response, filters thinking):
curl -s -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-32B",
"messages": [{"role": "user", "content": "What are the key advantages of using GPUs for AI inference, and how do they compare to CPUs?"}],
"max_tokens": 2048,
"stream": true
}' | python3 -c "
import sys, json
for line in sys.stdin:
if line.startswith('data: '):
data = line[6:].strip()
if data == '[DONE]':
break
try:
chunk = json.loads(data)
if 'choices' in chunk and len(chunk['choices']) > 0:
delta = chunk['choices'][0].get('delta', {})
# Only show content (final response), skip reasoning_content
content = delta.get('content', '')
if content:
print(content, end='', flush=True)
except:
pass
print() # Newline at end
"Note: The -s flag in curl suppresses progress output. Without it, you'll see curl's progress statistics mixed with the stream.
Expected Behavior:
- Text appears incrementally as it's generated
- Thinking process may be visible first
- Final response follows
- User sees progress, reducing perception of slowness
Command (non-streaming with higher token limit):
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-32B",
"messages": [
{"role": "user", "content": "What are the key advantages of using GPUs for AI inference, and how do they compare to CPUs?"}
],
"max_tokens": 2048,
"temperature": 0.7
}'Note: Using max_tokens: 2048 ensures thinking completes before generating response. For very complex questions, use 4096.
Expected Output:
{
"id": "chatcmpl-<id>",
"object": "chat.completion",
"created": <timestamp>,
"model": "Qwen/Qwen3-32B",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "<thinking process>...<final response>"
},
"finish_reason": "stop" or "length"
}
],
"usage": {
"prompt_tokens": <n>,
"completion_tokens": <n>,
"total_tokens": <n>
}
}What to Check:
- Response is valid JSON
modelfield shows "Qwen/Qwen3-32B"choices[0].message.contentcontains text (may include thinking + response)usage.completion_tokensshows total tokens used (thinking + response)finish_reasonis "stop" (complete) or "length" (truncated - increase max_tokens)
Token Allocation Guidelines:
Important: Qwen3's max_tokens applies to total output (thinking + response). If thinking uses all tokens, no response is generated. Allocate generously:
- Short responses:
max_tokens: 512(200-300 thinking + 200-300 response) - Medium responses:
max_tokens: 1024(400-500 thinking + 500-600 response) - Recommended minimum - Long responses:
max_tokens: 2048(800-1000 thinking + 1000-1200 response) - Recommended for complex questions - Very long:
max_tokens: 4096(1500-2000 thinking + 2000-2500 response)
Rule of thumb: Allocate 2-3x more tokens than you think you need. Qwen3's thinking can be extensive, especially for complex questions.
Extract just the final response (filter thinking):
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-32B",
"messages": [{"role": "user", "content": "Say hello in one sentence."}],
"max_tokens": 256
}' | python3 -c "
import sys, json
data = json.load(sys.stdin)
content = data['choices'][0]['message']['content']
# Remove thinking tags if present
content = content.replace('<thinking>', '').replace('</thinking>', '')
# Or extract text after thinking markers
print(content.strip())
"Expected Output:
Hello! How can I assist you today?
Qwen3 uses a reasoning process that:
- Thinks first (internal reasoning, may be visible in output)
- Then responds (final answer to user)
This is why:
- Responses may seem slow initially (model is thinking)
- Higher
max_tokensis needed (thinking + response) - Streaming is recommended (shows progress)
Troubleshooting:
- If you see curl progress output but no content: Use
curl -sflag to suppress progress output. The script needs clean JSON lines starting withdata:. - If request appears to hang: This is normal - Qwen3 is thinking. Use streaming to see progress (you'll see
reasoning_contentfirst). - If no content appears in stream: Check that script handles both
reasoning_contentandcontentfields (see examples above). - If only thinking, no response (finish_reason: "length"): This is the most common issue! Increase
max_tokensto 2048 or 4096. Thinking used all tokens, leaving none for response. - If response is cut off: Increase
max_tokensfurther - both thinking and response need tokens - If thinking seems incomplete: Increase
max_tokens- Qwen3 needs enough tokens to complete its reasoning - If request times out: Check
max_tokensisn't too high, or increase timeout - If 500 error: Check container logs:
docker logs aim-qwen3-32b | tail -20 - If "model not found": Verify model loaded correctly in logs
Key Fix for Incomplete Output: If you see thinking but no response, or thinking cuts off mid-sentence:
- Check
finish_reasonin the response - if it's "length", tokens ran out - Increase
max_tokensto at least 2048 (4096 for complex questions) - Remember:
max_tokens= thinking tokens + response tokens - Qwen3's thinking can be 500-1000+ tokens for complex questions
Recommended Settings for Qwen3:
{
"model": "Qwen/Qwen3-32B",
"messages": [...],
"max_tokens": 2048,
"stream": true,
"temperature": 0.7,
"top_p": 0.9
}Why 2048 tokens?
- Qwen3's thinking process can use 500-1000+ tokens for complex questions
- Response typically needs 500-1500 tokens
- 2048 ensures thinking completes AND response is generated
- If you see
finish_reason: "length"with only thinking, increase to 4096
Command:
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "The AMD Inference Microservice (AIM) is",
"max_tokens": 50
}'Expected Output:
{
"id": "cmpl-<id>",
"object": "text_completion",
"created": <timestamp>,
"model": "Qwen/Qwen3-32B",
"choices": [
{
"text": "<generated text>",
"index": 0,
"finish_reason": "length" or "stop",
"logprobs": null
}
],
"usage": {
"prompt_tokens": <n>,
"completion_tokens": <n>,
"total_tokens": <n>
}
}What to Check:
- Response is valid JSON
choices[0].textcontains generated continuation- Token usage is shown
Troubleshooting:
- Similar to chat completions troubleshooting
- Some models may prefer chat format over completions
While running inference, monitor GPU:
rocm-smi --showmemuse --showuseExpected Output:
=================================== % time GPU is busy ===================================
GPU[0] : GPU use (%): <percentage>
GPU[0] : GFX Activity: <value>
==========================================================================================
=================================== Current Memory Use ===================================
GPU[0] : GPU Memory Allocated (VRAM%): <percentage>
GPU[0] : GPU Memory Read/Write Activity (%): <percentage>
What to Check:
- GPU use increases during inference (may reach 50-100%)
- VRAM usage is high (85-95% for 32B model)
- Memory activity increases during processing
Expected Behavior:
- GPU use: 0-10% when idle, 50-100% during inference
- VRAM: 85-95% allocated for loaded model
- Memory activity: Increases during token generation
Troubleshooting:
- If GPU use stays at 0% during inference, GPU may not be utilized (check logs)
- If VRAM is very low, model may not have loaded correctly
- If memory activity is always 0%, check if inference is actually running
Test Response Time:
time curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello"}], "max_tokens": 10}' \
-o /dev/null -sExpected Output:
real 0m<seconds>.<ms>s
user 0m0.00Xs
sys 0m0.00Xs
What to Check:
- First request may take 10-60 seconds (cold start)
- Subsequent requests should be faster (5-30 seconds depending on length)
- Response time is reasonable for model size
Test Multiple Requests:
for i in {1..3}; do
echo "Request $i:"
time curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Count to 5."}], "max_tokens": 20}' \
-o /dev/null -s
echo ""
doneExpected Behavior:
- First request: Slower (may include initialization)
- Subsequent requests: Faster and more consistent
Troubleshooting:
- If all requests are very slow, check GPU utilization
- If requests fail intermittently, check container resource limits
- If response times are inconsistent, check for resource contention
Before considering deployment complete, verify:
- Container is running:
docker ps | grep aim-qwen3-32bshows "Up" - API server is ready: Logs show "Application startup complete"
- Health endpoint responds (if available):
curl http://localhost:8000/health - Models endpoint works:
curl http://localhost:8000/v1/modelsreturns JSON - Chat completions work: Request returns valid response with generated text
- GPU is utilized:
rocm-smishows GPU activity during inference - VRAM is allocated:
rocm-smi --showmemuseshows high VRAM usage - Response times are acceptable: Requests complete in reasonable time
- No errors in logs:
docker logs aim-qwen3-32bshows no critical errors
If all checks pass, your AIM deployment is fully operational!
- Automatically matches hardware capabilities with optimal performance profiles
- Supports multiple precision formats (FP8, FP16)
- Optimizes for latency or throughput based on configuration
- Detects GPU model and architecture
- Configures vLLM with optimal parameters for MI300X
- Enables ROCm-specific optimizations (Aiter, Triton kernels)
- Full OpenAI API compatibility
- Supports chat completions, completions, embeddings
- Standard REST endpoints for easy integration
- Model: Qwen3-32B with reasoning capabilities
- Max Context Length: 32,768 tokens
- Precision: FP16 (float16)
- Tensor Parallelism: 1 (single GPU)
- Max Sequences: 512 concurrent
- Reasoning Parser: Qwen3
docker run -d --name aim-qwen3-32b \
-e PYTHONUNBUFFERED=1 \
--device=/dev/kfd \
--device=/dev/dri \
--security-opt seccomp=unconfined \
--group-add video \
--ipc=host \
--shm-size=8g \
-p 8000:8000 \
amdenterpriseai/aim-qwen-qwen3-32b:0.8.4 serve--device=/dev/kfd --device=/dev/dri: GPU device access--security-opt seccomp=unconfined: Required for ROCm--group-add video: GPU access permissions--ipc=host: Shared memory for multi-process communication--shm-size=8g: Shared memory for model weights-p 8000:8000: API server port mapping
list-profiles: List all available performance profilesdry-run: Preview selected profile and generated commandserve: Start the inference serverdownload-to-cache: Pre-download models to cache
- Model Loading: ~2.5 minutes for 32B model
- GPU Memory: 91% utilization during inference
- API Response: Sub-minute response times for typical queries
- Concurrent Requests: Supports up to 512 concurrent sequences
Once your AIM inference service is running, you can connect to it from client applications on your laptop or other machines. This section covers different connection methods and provides code examples.
If your AIM service is running on a remote MI300X node accessible via SSH:
Step 1: Set up SSH port forwarding from your laptop:
ssh -L 8000:localhost:8000 user@remote-mi300x-hostStep 2: If AIM is running in Kubernetes on the remote node, first SSH in and set up port forwarding:
# On remote node
kubectl port-forward service/aim-service-predictor 8000:80Step 3: Then from your laptop, forward to that port:
ssh -L 8000:localhost:8000 user@remote-mi300x-hostStep 4: Your client can now connect to http://localhost:8000
If the endpoint is directly accessible (public IP or VPN):
For Docker deployment:
- Find the remote node's IP:
hostname -Iorcurl ifconfig.me - Connect to:
http://<remote-ip>:8000
For Kubernetes deployment:
- Use NodePort:
kubectl get svcto find the NodePort - Use LoadBalancer:
kubectl get svcto find the LoadBalancer IP - Or configure Ingress for domain-based access
Install dependencies:
pip install requestsBasic usage:
import requests
# Initialize client (use localhost if using SSH port forwarding)
endpoint = "http://localhost:8000" # or http://<remote-ip>:8000
# List available models
response = requests.get(f"{endpoint}/v1/models")
models = response.json()
print(models)
# Send a chat completion
response = requests.post(
f"{endpoint}/v1/chat/completions",
json={
"messages": [
{"role": "user", "content": "What is AIM?"}
],
"max_tokens": 2048,
"temperature": 0.7
}
)
result = response.json()
print(result['choices'][0]['message']['content'])Streaming example:
import requests
import json
response = requests.post(
f"{endpoint}/v1/chat/completions",
json={
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"max_tokens": 2048,
"stream": True
},
stream=True
)
for line in response.iter_lines():
if not line:
continue
line = line.decode('utf-8')
if line.startswith('data: '):
data = line[6:].strip()
if data == '[DONE]':
break
try:
chunk = json.loads(data)
if 'choices' in chunk:
delta = chunk['choices'][0].get('delta', {})
# Qwen3 uses reasoning_content for thinking, content for response
reasoning = delta.get('reasoning_content', '')
content = delta.get('content', '')
if reasoning:
print(reasoning, end='', flush=True)
if content:
print(content, end='', flush=True)
except json.JSONDecodeError:
continueconst axios = require('axios');
const endpoint = 'http://localhost:8000'; // or remote IP
// Chat completion
async function chatCompletion(prompt) {
const response = await axios.post(`${endpoint}/v1/chat/completions`, {
messages: [{ role: 'user', content: prompt }],
max_tokens: 2048,
temperature: 0.7
});
return response.data.choices[0].message.content;
}
// Usage
chatCompletion('What is AIM?').then(console.log);Basic request:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 2048
}'Streaming request:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 2048,
"stream": true
}' \
--no-buffer -N -s | \
while IFS= read -r line; do
line=${line#data: }
[[ "$line" == "[DONE]" ]] && break
echo "$line" | jq -r '.choices[0].delta.content // empty' 2>/dev/null
doneQwen3 models use a reasoning process before generating responses:
-
Token Allocation: The
max_tokensparameter applies to total output (thinking + response). Allocate generously:- Short responses:
max_tokens: 512 - Medium responses:
max_tokens: 1024(recommended minimum) - Long responses:
max_tokens: 2048(recommended for complex questions) - Very long:
max_tokens: 4096
- Short responses:
-
Streaming Fields: When streaming, Qwen3 provides:
reasoning_content: The thinking/reasoning process (appears first)content: The final response (appears after thinking)
-
Response Time: Initial responses may seem slow as the model thinks first, then responds. Use streaming to see progress.
Environment Variables:
export AIM_ENDPOINT="http://localhost:8000"
export AIM_MODEL="Qwen/Qwen3-32B"Timeout Settings: For long-running requests, increase timeouts:
import requests
response = requests.post(url, json=data, timeout=600) # 10 minutesError Handling:
try:
response = requests.post(url, json=data, timeout=300)
response.raise_for_status()
except requests.exceptions.Timeout:
print("Request timed out - try increasing max_tokens or timeout")
except requests.exceptions.ConnectionError:
print("Cannot connect - check SSH tunnel or network")
except requests.exceptions.HTTPError as e:
print(f"HTTP error: {e}")- Connection Refused: Ensure SSH tunnel is active or endpoint is accessible
- Timeout Errors: Increase timeout values or reduce
max_tokens - Model Not Found: List available models:
curl http://localhost:8000/v1/models - Incomplete Responses: Increase
max_tokens- Qwen3's thinking process uses tokens too
For more details on API endpoints and parameters, see the OpenAI API documentation (AIM is fully compatible with OpenAI's API format).
For Kubernetes deployment with KServe, see the comprehensive guides in the k8s/ directory:
- Kubernetes Deployment Guide - Overview and quick start
- Detailed Kubernetes Walkthrough - Complete step-by-step instructions
- Kubernetes Quick Reference - Quick command reference
The Kubernetes deployment includes:
- KServe integration for Kubernetes-native model serving
- Autoscaling with KEDA based on custom metrics
- Observability with OpenTelemetry LGTM stack (Loki, Grafana, Tempo, Mimir)
- Production-ready configuration with health checks and resource management
Other AIM container images available:
amdenterpriseai/aim-meta-llama-llama-3-1-8b-instruct:0.8.4amdenterpriseai/aim-base:0.8
- Blog Post: https://rocm.blogs.amd.com/artificial-intelligence/enterprise-ai-aims/README.html
- Deployment Repository: https://github.com/amd-enterprise-ai/aim-deploy
- AIM Catalog: https://enterprise-ai.docs.amd.com/en/latest/aims/catalog/models.html
Symptoms:
- Container status shows "Exited" shortly after starting
docker ps -ashows container with exit code
Diagnosis:
docker logs aim-qwen3-32b
docker inspect aim-qwen3-32b | grep -A 10 StateCommon Causes:
-
GPU device access denied
- Solution: Verify device permissions:
ls -l /dev/kfd /dev/dri/card* - Add user to groups:
sudo usermod -aG render,video $USER - Or run container as root (if appropriate)
- Solution: Verify device permissions:
-
ROCm not properly installed
- Solution: Verify ROCm:
rocm-smi --version - Check kernel modules:
lsmod | grep amdgpu
- Solution: Verify ROCm:
-
Insufficient shared memory
- Solution: Increase shm-size:
--shm-size=16g(instead of 8g)
- Solution: Increase shm-size:
-
Port already in use
- Solution: Use different port:
-p 8001:8000or find and stop conflicting service
- Solution: Use different port:
Symptoms:
- Logs show "Detected 0 AMD GPU(s)"
- Profile selection fails
Diagnosis:
docker run --rm --device=/dev/kfd --device=/dev/dri \
--security-opt seccomp=unconfined --group-add video \
--ipc=host --shm-size=8g \
amdenterpriseai/aim-qwen-qwen3-32b:0.8.4 list-profilesSolutions:
-
Verify device mapping
# Check devices exist on host ls -la /dev/kfd /dev/dri/card0 # Test device access in container docker run --rm --device=/dev/kfd --device=/dev/dri \ --security-opt seccomp=unconfined --group-add video \ --ipc=host \ amdenterpriseai/aim-qwen-qwen3-32b:0.8.4 \ bash -c "ls -la /dev/kfd /dev/dri/"
-
Check ROCm in container
docker run --rm --device=/dev/kfd --device=/dev/dri \ --security-opt seccomp=unconfined --group-add video \ --ipc=host \ amdenterpriseai/aim-qwen-qwen3-32b:0.8.4 \ bash -c "rocm-smi" -
Verify seccomp settings
- Ensure
--security-opt seccomp=unconfinedis included - Some systems may require additional security options
- Ensure
Symptoms:
- Model download fails
- Loading progress stalls
- Container runs out of memory
Diagnosis:
docker logs aim-qwen3-32b | grep -i "error\|fail\|timeout"
docker stats aim-qwen3-32b --no-stream
rocm-smi --showmemuseSolutions:
-
Network issues during download
- Check connectivity:
ping -c 3 8.8.8.8 - Verify Docker Hub access:
curl -I https://hub.docker.com - Some CSPs require proxy configuration
- Check connectivity:
-
Insufficient disk space
- Check available space:
df -h / - Model weights can be 60GB+ for large models
- Clean up:
docker system prune
- Check available space:
-
Insufficient GPU memory
- Check VRAM:
rocm-smi --showmemuse - For 32B model, need ~180GB+ VRAM
- Consider smaller model or multi-GPU setup
- Check VRAM:
-
Memory allocation errors
- Increase shared memory:
--shm-size=16gor--shm-size=32g - Check system RAM:
free -h
- Increase shared memory:
Symptoms:
- Container is running but API doesn't respond
- Connection refused or timeout errors
Diagnosis:
docker logs aim-qwen3-32b | tail -50
docker ps | grep aim-qwen3-32b
netstat -tuln | grep 8000
curl -v http://localhost:8000/healthSolutions:
-
Server still starting
- Wait for "Application startup complete" in logs
- Model loading can take 2-5 minutes for 32B model
- Monitor logs:
docker logs -f aim-qwen3-32b
-
Port not mapped correctly
- Verify mapping:
docker ps | grep 8000 - Check if port is in use:
netstat -tuln | grep 8000 - Try different port:
-p 8001:8000
- Verify mapping:
-
Firewall blocking
- Check firewall rules:
sudo iptables -L -n | grep 8000 - Some CSPs have security groups that need configuration
- Check firewall rules:
-
Container networking issues
- Test from inside container:
docker exec aim-qwen3-32b curl http://localhost:8000/health - If works inside but not outside, check port mapping
- Test from inside container:
Symptoms:
- API returns 500 errors
- Requests timeout
- No response generated
Diagnosis:
docker logs aim-qwen3-32b | tail -100
rocm-smi --showuse --showmemuse
docker stats aim-qwen3-32b --no-streamSolutions:
-
Model not fully loaded
- Wait for "Application startup complete"
- Check model loading progress in logs
-
GPU out of memory
- Check VRAM usage:
rocm-smi --showmemuse - Reduce
max_tokensin requests - Reduce
max-num-seqsin profile (requires custom profile)
- Check VRAM usage:
-
Request format incorrect
- Verify JSON format:
echo '{"messages":[...]}' | python3 -m json.tool - Check required fields are present
- Ensure Content-Type header:
-H "Content-Type: application/json"
- Verify JSON format:
-
Resource exhaustion
- Check CPU/memory:
docker stats aim-qwen3-32b - Check system resources:
top,free -h - Restart container if needed:
docker restart aim-qwen3-32b
- Check CPU/memory:
Symptoms:
- Very slow inference (minutes per request)
- Low GPU utilization
- High latency
Diagnosis:
rocm-smi --showuse
docker stats aim-qwen3-32b --no-stream
docker logs aim-qwen3-32b | grep -i "profile\|optimization"Solutions:
-
Wrong profile selected
- Check selected profile in logs
- Verify it matches your GPU model
- Try manual profile selection (advanced)
-
GPU in low-power state
- Check power state:
rocm-smi --showpower - Some GPUs need workload to wake up
- First request may be slower
- Check power state:
-
System resource contention
- Check other processes using GPU:
rocm-smi - Check CPU usage:
top - Ensure sufficient system resources
- Check other processes using GPU:
-
Network latency (if accessing remotely)
- Test locally first:
curl http://localhost:8000/... - If remote access needed, consider port forwarding or load balancer
- Test locally first:
Symptoms:
- Container status shows "Restarting"
- Exit code is non-zero
- Logs show repeated startup attempts
Diagnosis:
docker ps -a | grep aim-qwen3-32b
docker inspect aim-qwen3-32b | grep -A 10 RestartPolicy
docker logs aim-qwen3-32b | tail -100Solutions:
-
Check restart policy
- Default may be "always" causing restarts
- Remove container and recreate without restart policy
- Or set to "no":
--restart=no
-
Identify root cause
- Check exit code:
docker inspect aim-qwen3-32b | grep ExitCode - Review logs for error pattern
- Common causes: OOM, device access, configuration errors
- Check exit code:
-
Fix underlying issue
- Address the root cause (see other troubleshooting sections)
- Once fixed, container should stay running
If issues persist after trying the above solutions:
-
Collect Diagnostic Information:
# System information uname -a docker --version rocm-smi --version # Container status docker ps -a | grep aim docker inspect aim-qwen3-32b # Recent logs docker logs aim-qwen3-32b 2>&1 | tail -200 # GPU status rocm-smi rocm-smi --showmemuse --showuse # System resources free -h df -h /
-
Check AIM Documentation:
- GitHub Repository: https://github.com/amd-enterprise-ai/aim-deploy
- Open issues for known problems
- Check release notes for version-specific issues
-
Contact Support:
- For CSP-specific issues, contact your cloud provider support
- For ROCm issues, check AMD ROCm documentation
- For AIM-specific issues, open GitHub issue with diagnostic information
This section covers best practices for managing AIM containers, cleaning up resources, and maintaining a clean Docker environment.
Stop a specific container:
docker stop aim-qwen3-32bStop all running AIM containers:
docker ps --filter "name=aim" --format "{{.Names}}" | xargs -r docker stopStop all containers (use with caution):
docker stop $(docker ps -q)Remove a stopped container:
docker rm aim-qwen3-32bRemove container even if running (force):
docker rm -f aim-qwen3-32bRemove all stopped AIM containers:
docker ps -a --filter "name=aim" --format "{{.Names}}" | xargs -r docker rmRemove all stopped containers:
docker container prune -fRemove unused containers, networks, and images:
docker system pruneRemove all unused resources including volumes (more aggressive):
docker system prune -a --volumesRemove only unused images:
docker image prune -aRemove only unused volumes:
docker volume pruneRemove only unused networks:
docker network pruneStop and remove all AIM containers:
#!/bin/bash
# Stop all AIM containers
docker ps --filter "name=aim" --format "{{.Names}}" | xargs -r docker stop
# Remove all AIM containers
docker ps -a --filter "name=aim" --format "{{.Names}}" | xargs -r docker rm
# Optional: Remove AIM images (will need to pull again)
# docker rmi amdenterpriseai/aim-qwen-qwen3-32b:0.8.4
echo "AIM containers cleaned up"List all containers (running and stopped):
docker ps -aList only running containers:
docker psList containers by name pattern:
docker ps -a --filter "name=aim"Check container resource usage:
docker stats aim-qwen3-32bCheck all container resource usage:
docker statsView recent logs:
docker logs aim-qwen3-32bFollow logs in real-time:
docker logs -f aim-qwen3-32bView last N lines:
docker logs --tail 50 aim-qwen3-32bView logs with timestamps:
docker logs -t aim-qwen3-32bRestart a container:
docker restart aim-qwen3-32bStart a stopped container:
docker start aim-qwen3-32bStop a running container:
docker stop aim-qwen3-32bCheck Docker disk usage:
docker system dfDetailed breakdown:
docker system df -vCheck specific container size:
docker ps -s --filter "name=aim-qwen3-32b"-
Regular Cleanup:
# Weekly cleanup of unused resources docker system prune -f -
Before Redeployment:
# Stop and remove old container before deploying new one docker stop aim-qwen3-32b docker rm aim-qwen3-32b -
Monitor Resource Usage:
# Keep an eye on disk space docker system df df -h / -
Preserve Important Containers:
# Tag important containers before cleanup docker tag amdenterpriseai/aim-qwen-qwen3-32b:0.8.4 my-aim-backup:0.8.4 -
Clean Up After Testing:
# Remove test containers and images docker ps -a --filter "name=test" --format "{{.Names}}" | xargs -r docker rm -f docker images --filter "dangling=true" -q | xargs -r docker rmi
Container won't stop:
# Force stop
docker kill aim-qwen3-32b
# Then remove
docker rm aim-qwen3-32bContainer keeps restarting:
# Check restart policy
docker inspect aim-qwen3-32b | grep -A 5 RestartPolicy
# Remove restart policy
docker update --restart=no aim-qwen3-32bPort already in use:
# Find what's using the port
sudo lsof -i :8000
# Or
sudo netstat -tulpn | grep 8000
# Stop the conflicting container
docker ps | grep 8000
docker stop <container-id>Out of disk space:
# Check usage
docker system df
# Clean up
docker system prune -a --volumes
# Check system disk
df -h /Container logs too large:
# Truncate logs (requires container restart)
truncate -s 0 $(docker inspect --format='{{.LogPath}}' aim-qwen3-32b)
# Or configure log rotation in docker daemon# Stop AIM container
docker stop aim-qwen3-32b
# Remove AIM container
docker rm aim-qwen3-32b
# Stop and remove in one command
docker rm -f aim-qwen3-32b
# View container status
docker ps -a | grep aim
# View container logs
docker logs aim-qwen3-32b
# Check resource usage
docker stats aim-qwen3-32b
# Clean up unused resources
docker system prune -f
# Check disk usage
docker system dfThe AIM framework provides a streamlined way to deploy AI models on AMD Instinct GPUs with minimal configuration overhead.
This comprehensive validation guide ensures that anyone with access to a similar CSP node can verify each step of the deployment process and troubleshoot issues as they arise.