Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stress-testing NVMe hangs for hours due to invoking offline CPU nodes #463

Open
Asalle opened this issue Dec 18, 2024 · 2 comments
Open

Comments

@Asalle
Copy link

Asalle commented Dec 18, 2024

Environment:

  • Machine: Jetson AGX Orin
  • Architecture: arm64
  • Operating system: Ubuntu 22.04.4 LTS (jammy)
  • stress-ng version: stress-ng, version 0.18.07 (gcc 11.4.0, aarch64 Linux 5.15.148-tegra) 💻🔥

Steps to reproduce:

  1. Install stress-ng from the PPA: ppa:colin-king/stress-ng
  2. Create temp workdir /mnt/nvme0n1p1/bg-temp/ and mount the drive: sudo mkdir -p /mnt/nvme0n1p1/ && sudo mount /dev/nvme0n1p1 /mnt/nvme0n1p1 && mkdir /mnt/nvme0n1p1/bg-temp/
  3. Run test:
    sudo stress-ng --aggressive --verify --timeout 10 --temp-path /mnt/nvme0n1p1/bg-temp/ --hdd-opts dsync --readahead-bytes 16M -k --chmod 0

Expected result:
Test finishes within 10 seconds timeout

Actual result:
Test hangs indefinitely.

$ sudo stress-ng --aggressive --verify --timeout 10 --temp-path /mnt/nvme0n1p1/bg-temp/ --hdd-opts dsync --readahead-bytes 16M -k --chmod 0
stress-ng: info:  [2130] setting to a 10 secs run per stressor
stress-ng: info:  [2130] dispatching hogs: 12 chmod
^C^Z

Probable root cause:

Command probably hangs due to using CPU nodes that are offline. The failure case logs show 12 chmods, whereas according to jetson_clocks only 8 CPU nodes are currently online:

$ sudo jetson_clocks --show
SOC family:tegra234  Machine:NVIDIA Jetson AGX Orin Developer Kit
Online CPUs: 0-7

Making all 12 CPU nodes available with sudo nvpmodel -m 3 makes the above command work as expected (nvpmodel is a NVIDIA utility for switching between power modes, by default we're in power saving mode which explains why some CPUs are offline.)

@ColinIanKing
Copy link
Owner

I suggest re-running this using:

sudo stress-ng --aggressive --verify --timeout 10 --temp-path /mnt/nvme0n1p1/bg-temp/ --hdd-opts dsync --readahead-bytes 16M -k --chmod -1 --vmstat 1 --klog-check -v

Using --chmod -1 will select just the online'd number of CPUs rather than the total system configured number of cpus. Using --vmstat 1 will show you the system activity to see if it's still doing I/O after the 10 seconds. Using --klog-check will dump out any kernel errors found from the kernel log. The -v option will show the stress-ng activity with in verbose mode.

The manual states:

"One can specify the number of processes to invoke per type of stress test; specifying a zero value will select the
number of processors available as defined by sysconf(_SC_NPROCESSORS_CONF), if that can't be determined then the number of online CPUs is used. If the value is less than zero then the number of online CPUs is used."

@Asalle
Copy link
Author

Asalle commented Dec 18, 2024

Hi Colin, thanks for the quick reply!
There are some vmstat lines after the initial 10s timeout that just keep going until I Ctrl-Z, but there's no I/O, I don't see any kernel errors in the log either. But -1 does let us use only the CPUs that are online. Unfortunately the command still hangs on "power-saving" mode. Once I switch to "all 12 CPUs available" mode I have a success.

stress-ng-success.log
stress-ng-failure.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants