Skip to content

Commit 4a88431

Browse files
Jordi FierroJordi Fierro
authored andcommitted
Post ai rig 2nd part
1 parent e94cfff commit 4a88431

File tree

6 files changed

+327
-3
lines changed

6 files changed

+327
-3
lines changed

_posts/2025-06-15-ai-rig-from-scratch-1.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -149,16 +149,18 @@ Tadaaaaa!
149149
![AI rig from scratch](/assets/images/rig_finished_1.jpg)
150150

151151
All that was left was to connect the main power supply, attach the WiFi antennas,
152-
and plug in a temporary screen and keyboard to install Ubuntu Server.
152+
and plug in a temporary screen and keyboard to
153+
start [software installation](https://jordifierro.dev/ai-rig-from-scratch-2).
153154

154155
### Bonus: An extra fan for peace of mind
155156

156-
After installing the OS and running some thermal tests (more on that in the next post!),
157+
After installing the OS and running some
158+
[thermal tests](https://jordifierro.dev/ai-rig-from-scratch-2)
157159
I noticed that one of the SSD sensors was reporting high temperatures.
158160
To improve airflow, I decided to add another slim fan to the bottom of the case.
159161

160162
The magnetic dust filter on the bottom made this incredibly easy.
161-
The Arctic P12 Slim fan even came with a Y-splitter cable, making the connection straightforward.
163+
The **Arctic P12 Slim** fan even came with a Y-splitter cable, making the connection straightforward.
162164
I did have to briefly remove the GPU to access the fan header, but it was no big deal.
163165

164166
![Second bottom fan](/assets/images/rig_second_fan_1.jpg)
Lines changed: 322 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,322 @@
1+
---
2+
layout: post
3+
title: "AI rig from scratch II: OS, drivers and stress testing"
4+
date: 2025-06-16 10:00:00 +0100
5+
categories: development
6+
comments: true
7+
---
8+
9+
# AI rig from scratch II: OS, drivers and stress testing
10+
11+
![AI rig back panel](/assets/images/rig_back_detail.png)
12+
13+
## Introduction: Bringing the beast to life
14+
15+
In the [first part of this series](https://jordifierro.dev/ai-rig-from-scratch-1),
16+
we carefully selected our components and assembled the hardware for my new AI rig.
17+
Now, with the physical build complete, it's time for the crucial next phase:
18+
installing the operating system, ensuring all components are correctly recognized,
19+
setting up the necessary drivers, and most importantly, verifying that our
20+
cooling system can handle intense AI workloads.
21+
22+
Let's get this machine ready to crunch some numbers!
23+
24+
## Step 1: Operating system and initial BIOS configuration
25+
26+
Choosing an OS for a headless AI server is a key decision. I chose **Ubuntu Server**
27+
for several reasons: it's stable, has extensive community support, and is
28+
widely used in the AI/ML world. Its command-line interface is perfect
29+
for a server that will be accessed remotely.
30+
31+
To start, I downloaded the latest ISO from the
32+
[official website](https://ubuntu.com/download/server) and used their
33+
[step-by-step tutorial](https://ubuntu.com/tutorials/install-ubuntu-server#1-overview)
34+
to create a bootable USB drive.
35+
36+
With the USB stick ready, I connected it to the rig along with a monitor,
37+
keyboard, and an Ethernet cable, and hit the power button for the first time.
38+
The installation process was straightforward. I mostly followed the defaults,
39+
with a few key selections:
40+
41+
* I attempted to install third-party drivers, but none were found at this stage.
42+
* I included **Docker** and **OpenSSH** in the initial setup, as I knew I would
43+
need them later.
44+
45+
Once the installation finished, I removed the USB drive and rebooted.
46+
The system came alive with a fresh OS. The first commands are always the same:
47+
48+
```bash
49+
sudo apt update && sudo apt upgrade
50+
```
51+
52+
Before diving deeper into the software, I rebooted and pressed the DEL key
53+
to enter the BIOS. There were two critical settings to adjust:
54+
55+
**RAM profile**: I enabled the AMD EXPO I profile to ensure my
56+
Patriot Viper Venom RAM was running at its rated speed of 6000MHz.
57+
58+
![AMD EXPO ram profile](/assets/images/rig_bios_2.jpg)
59+
60+
**Fan curve**: I switched the fan settings from "Silent" to "Standard"
61+
to prioritize cooling over absolute silence, which is a sensible
62+
trade-off for a high-performance machine.
63+
64+
![Fan settings](/assets/images/rig_bios_1.jpg)
65+
66+
After saving the changes and exiting the BIOS, the foundational setup was complete.
67+
68+
![Save changes and exit](/assets/images/rig_bios_3.jpg)
69+
70+
## Step 2: Establishing connectivity (Wi-Fi and remote access)
71+
72+
My plan is to place the rig in a convenient spot, which means I'll be relying
73+
on Wi-Fi instead of an Ethernet cable. On a server, setting up Wi-Fi
74+
requires a few manual steps.
75+
76+
First, I confirmed the Wi-Fi driver was loaded correctly by the kernel.
77+
78+
```bash
79+
# First, ensure core network tools are present
80+
sudo apt install wireless-tools
81+
82+
# Check for a wireless interface (e.g., wlan0 or, in my case, wl...)
83+
ip link
84+
lspci -nnk | grep -iA3 network
85+
dmesg | grep -i wifi
86+
```
87+
88+
The output confirmed the `mt7921e` driver for my motherboard's Wi-Fi chip
89+
was active. With the driver in place, I just needed to connect to my network
90+
using `network-manager`.
91+
92+
```bash
93+
# Install network-manager
94+
sudo apt install network-manager
95+
96+
# Scan for available networks
97+
nmcli device wifi list
98+
99+
# Connect to my home network (replace with your SSID and password)
100+
nmcli device wifi connect "Your_SSID" password "your_password"
101+
102+
# Test the connection
103+
ping -c 4 google.com
104+
105+
# Set the connection to start automatically on boot
106+
nmcli connection modify "Your_SSID" connection.autoconnect yes
107+
```
108+
109+
With the rig now on my local network, I enabled SSH to allow remote connections.
110+
111+
```bash
112+
sudo systemctl enable ssh
113+
sudo systemctl start ssh
114+
```
115+
116+
Now I could disconnect the monitor and keyboard and access the rig from my laptop!
117+
To take remote access a step further, I installed [Tailscale](https://tailscale.com), a fantastic tool
118+
that creates a secure private network (a VPN) between your devices. After signing
119+
up and following the simple instructions to add my rig and laptop, I could SSH
120+
into my machine from anywhere, not just my local network.
121+
122+
## Step 3: Verifying hardware and thermals
123+
124+
With the OS running, it was time to confirm that all our expensive components
125+
were recognized and running correctly. The BIOS gives a good overview,
126+
but we can double-check from the command line.
127+
128+
```bash
129+
# Check CPU info
130+
lscpu
131+
132+
# Check RAM size
133+
free -h
134+
135+
# List all PCI devices (including the GPU)
136+
lspci -v
137+
```
138+
139+
Everything looked good. Next, I checked the component temperatures at idle
140+
using `lm-sensors`.
141+
142+
```bash
143+
sudo apt install lm-sensors
144+
sensors
145+
```
146+
147+
This revealed an issue. While most temps were fine, one of the SSD sensors
148+
was running hot.
149+
150+
Initial idle temps (before adding extra fan):
151+
152+
```bash
153+
amdgpu-pci-0d00
154+
Adapter: PCI adapter
155+
vddgfx: 719.00 mV
156+
vddnb: 1.01 V
157+
edge: +48.0°C
158+
PPT: 20.10 W
159+
160+
nvme-pci-0200
161+
Adapter: PCI adapter
162+
Composite: +51.9°C (low = -273.1°C, high = +74.8°C) (crit = +79.8°C)
163+
Sensor 1: +70.8°C (low = -273.1°C, high = +65261.8°C) <-- This is too high for idle!
164+
Sensor 2: +51.9°C (low = -273.1°C, high = +65261.8°C)
165+
Sensor 3: +51.9°C (low = -273.1°C, high = +65261.8°C)
166+
167+
mt7921_phy0-pci-0800
168+
Adapter: PCI adapter
169+
temp1: +44.0°C
170+
171+
k10temp-pci-00c3
172+
Adapter: PCI adapter
173+
Tctl: +50.4°C
174+
Tccd1: +42.4°C
175+
```
176+
177+
This is why we test! As mentioned in [part I](https://jordifierro.dev/ai-rig-from-scratch-1),
178+
I installed an extra Arctic P12 Slim fan at the bottom of the case to improve airflow over
179+
the motherboard. The results were immediate and significant.
180+
181+
```bash
182+
nvme-pci-0200
183+
Adapter: PCI adapter
184+
Composite: +41.9°C (low = -273.1°C, high = +74.8°C) (crit = +79.8°C)
185+
Sensor 1: +60.9°C (low = -273.1°C, high = +65261.8°C)
186+
Sensor 2: +53.9°C (low = -273.1°C, high = +65261.8°C)
187+
Sensor 3: +41.9°C (low = -273.1°C, high = +65261.8°C)
188+
```
189+
190+
Problem solved. The extra 10€ fan was well worth it for the peace of mind.
191+
192+
## Step 4: Installing the NVIDIA driver
193+
194+
The most critical driver for an AI rig is the NVIDIA driver. I used the
195+
`ppa:graphics-drivers/ppa` repository to get the latest versions.
196+
197+
```bash
198+
sudo add-apt-repository ppa:graphics-drivers/ppa
199+
ubuntu-drivers devices
200+
201+
== /sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0 ==
202+
modalias : pci:v000010DEd00002D04sv00001043sd00008A11bc03sc00i00
203+
vendor : NVIDIA Corporation
204+
driver : nvidia-driver-570 - third-party non-free recommended
205+
driver : nvidia-driver-570-open - third-party non-free
206+
driver : xserver-xorg-video-nouveau - distro free builtin
207+
```
208+
209+
The tool recommended the proprietary driver, but I found that the open-source
210+
kernel module version (`-open`) was the one that worked for my setup.
211+
212+
To install it and prevent conflicts with the default `nouveau` driver,
213+
I ran the following:
214+
215+
```bash
216+
# Install the open-source variant of the driver
217+
sudo apt install nvidia-driver-570-open
218+
219+
# Blacklist the default nouveau driver
220+
sudo bash -c 'echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist-nouveau.conf'
221+
222+
# Update the initial RAM file system and reboot
223+
sudo update-initramfs -u
224+
sudo reboot
225+
```
226+
227+
After the reboot, running `nvidia-smi` confirmed the driver was loaded
228+
and the GPU was ready!
229+
230+
## Step 5: Putting the rig to the test (stress testing)
231+
232+
With everything installed, it was time for the moment of truth. Can the system
233+
remain stable and cool under heavy, sustained load? I conducted three separate
234+
stress tests, monitoring temperatures in a separate SSH window using
235+
`watch sensors` and `watch nvidia-smi`.
236+
237+
### CPU stress test
238+
239+
First, I used `stress-ng` to max out all 8 CPU cores for 5 minutes.
240+
241+
```bash
242+
sudo apt install stress-ng
243+
stress-ng --cpu 8 --timeout 300s
244+
```
245+
246+
**Result**: The CPU temperature peaked at **73.4°C**. This is a great result,
247+
showing the AIO cooler is more than capable of handling the Ryzen 7 7700
248+
at full tilt.
249+
250+
```bash
251+
k10temp-pci-00c3
252+
Adapter: PCI adapter
253+
Tctl: +73.4°C
254+
```
255+
256+
### SSD stress test
257+
258+
Next, I used `fio` to simulate a heavy random write workload on the NVMe SSD
259+
for 1 minute.
260+
261+
```bash
262+
sudo apt install fio
263+
fio --name=nvme_stress_test --ioengine=libaio --rw=randwrite --bs=4k --size=1G --numjobs=4 --time_based --runtime=60 --group_reporting
264+
```
265+
266+
**Result**: The notorious "Sensor 1" heated up to **89.8°C**. While high,
267+
this is a worst-case scenario, and the drive's critical temperature is even higher.
268+
The overall `Composite` temperature remained at a healthy **56.9°C**. For my
269+
use case, this is perfectly acceptable.
270+
271+
```bash
272+
nvme-pci-0200
273+
Adapter: PCI adapter
274+
Composite: +56.9°C (crit = +79.8°C)
275+
Sensor 1: +89.8°C
276+
```
277+
278+
### GPU stress test
279+
280+
Finally, the main event. I used [gpu-burn](https://github.com/wilicc/gpu-burn) inside a Docker container to push
281+
the RTX 5060 Ti to its absolute limit. First, I had to set up the NVIDIA Container Toolkit.
282+
283+
```bash
284+
# Setup the NVIDIA Container Toolkit
285+
distribution=ubuntu22.04 # Workaround for 24.04
286+
curl -s -L [https://nvidia.github.io/nvidia-docker/gpgkey](https://nvidia.github.io/nvidia-docker/gpgkey) | sudo apt-key add -
287+
curl -s -L [https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list](https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list) | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
288+
sudo apt update
289+
sudo apt install -y nvidia-docker2
290+
sudo systemctl restart docker
291+
```
292+
293+
With Docker ready, I cloned the `gpu-burn` repository and ran the test.
294+
295+
```bash
296+
git clone https://github.com/wilicc/gpu-burn
297+
cd gpu-burn
298+
docker build -t gpu_burn .
299+
docker run --rm --gpus all gpu_burn
300+
```
301+
302+
**Result**: Success! The GPU temperature steadily climbed but stabilized
303+
at a maximum of **72°C** while running at 100% load, processing nearly
304+
5000 Gflop/s. The test completed with zero errors.
305+
306+
```bash
307+
100.0% proc'd: 260 (4880 Gflop/s) errors: 0 temps: 72 C
308+
...
309+
Tested 1 GPUs:
310+
GPU 0: OK
311+
```
312+
313+
## Conclusion: We are ready for AI!
314+
315+
The rig is alive, stable, and cool. We've successfully installed and configured
316+
the operating system, established remote connectivity, verified all our hardware,
317+
and pushed every core component to its limit to ensure it can handle the heat.
318+
319+
The system passed all tests with flying colors, proving that our component choices
320+
and cooling setup were effective. Now that we have a solid and reliable foundation,
321+
the real fun can begin. In the next post, we'll finally start using this machine
322+
for its intended purpose: **running and training AI models**. Stay tuned!

assets/images/rig_back_detail.png

697 KB
Loading

assets/images/rig_bios_1.jpg

343 KB
Loading

assets/images/rig_bios_2.jpg

349 KB
Loading

assets/images/rig_bios_3.jpg

396 KB
Loading

0 commit comments

Comments
 (0)