Stress-testing NVMe hangs for hours due to invoking offline CPU nodes #463

Asalle · 2024-12-18T15:00:02Z

Environment:

Machine: Jetson AGX Orin
Architecture: arm64
Operating system: Ubuntu 22.04.4 LTS (jammy)
stress-ng version: stress-ng, version 0.18.07 (gcc 11.4.0, aarch64 Linux 5.15.148-tegra) 💻🔥

Steps to reproduce:

Install stress-ng from the PPA: ppa:colin-king/stress-ng
Create temp workdir /mnt/nvme0n1p1/bg-temp/ and mount the drive: sudo mkdir -p /mnt/nvme0n1p1/ && sudo mount /dev/nvme0n1p1 /mnt/nvme0n1p1 && mkdir /mnt/nvme0n1p1/bg-temp/
Run test:
sudo stress-ng --aggressive --verify --timeout 10 --temp-path /mnt/nvme0n1p1/bg-temp/ --hdd-opts dsync --readahead-bytes 16M -k --chmod 0

Expected result:
Test finishes within 10 seconds timeout

Actual result:
Test hangs indefinitely.

$ sudo stress-ng --aggressive --verify --timeout 10 --temp-path /mnt/nvme0n1p1/bg-temp/ --hdd-opts dsync --readahead-bytes 16M -k --chmod 0
stress-ng: info:  [2130] setting to a 10 secs run per stressor
stress-ng: info:  [2130] dispatching hogs: 12 chmod
^C^Z

Probable root cause:

Command probably hangs due to using CPU nodes that are offline. The failure case logs show 12 chmods, whereas according to jetson_clocks only 8 CPU nodes are currently online:

$ sudo jetson_clocks --show
SOC family:tegra234  Machine:NVIDIA Jetson AGX Orin Developer Kit
Online CPUs: 0-7

Making all 12 CPU nodes available with sudo nvpmodel -m 3 makes the above command work as expected (nvpmodel is a NVIDIA utility for switching between power modes, by default we're in power saving mode which explains why some CPUs are offline.)

The text was updated successfully, but these errors were encountered:

ColinIanKing · 2024-12-18T17:23:22Z

I suggest re-running this using:

sudo stress-ng --aggressive --verify --timeout 10 --temp-path /mnt/nvme0n1p1/bg-temp/ --hdd-opts dsync --readahead-bytes 16M -k --chmod -1 --vmstat 1 --klog-check -v

Using --chmod -1 will select just the online'd number of CPUs rather than the total system configured number of cpus. Using --vmstat 1 will show you the system activity to see if it's still doing I/O after the 10 seconds. Using --klog-check will dump out any kernel errors found from the kernel log. The -v option will show the stress-ng activity with in verbose mode.

The manual states:

"One can specify the number of processes to invoke per type of stress test; specifying a zero value will select the
number of processors available as defined by sysconf(_SC_NPROCESSORS_CONF), if that can't be determined then the number of online CPUs is used. If the value is less than zero then the number of online CPUs is used."

Asalle · 2024-12-18T22:58:52Z

Hi Colin, thanks for the quick reply!
There are some vmstat lines after the initial 10s timeout that just keep going until I Ctrl-Z, but there's no I/O, I don't see any kernel errors in the log either. But -1 does let us use only the CPUs that are online. Unfortunately the command still hangs on "power-saving" mode. Once I switch to "all 12 CPUs available" mode I have a success.

stress-ng-success.log
stress-ng-failure.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stress-testing NVMe hangs for hours due to invoking offline CPU nodes #463

Stress-testing NVMe hangs for hours due to invoking offline CPU nodes #463

Asalle commented Dec 18, 2024 •

edited

Loading

ColinIanKing commented Dec 18, 2024

Asalle commented Dec 18, 2024

Stress-testing NVMe hangs for hours due to invoking offline CPU nodes #463

Stress-testing NVMe hangs for hours due to invoking offline CPU nodes #463

Comments

Asalle commented Dec 18, 2024 • edited Loading

Environment:

Steps to reproduce:

Probable root cause:

ColinIanKing commented Dec 18, 2024

Asalle commented Dec 18, 2024

Asalle commented Dec 18, 2024 •

edited

Loading