scripts/perftune.py: detect corrupted NUMA topology information #2956

syuu1228 · 2025-09-01T13:35:07Z

There are some environment which has corrupted NUMA topology information, such as some instance types on AWS EC2 with specific Linux kernel images.
On such environment, we cannot get HW information correctly from hwloc, so we cannot proceed optimization on perftune.
It should print readable error message and exit perftune.py with non-zero exit code.

Fixes #2925

avikivity · 2025-09-02T11:54:52Z

scripts/perftune.py

        else:
+            if not check_sysfs_numa_topology_is_valid():
+                raise Exception("NUMA topology information is corrupted")
            self.irqs_cpu_mask = auto_detect_irq_mask(self.cpu_mask, self.cores_per_irq_core)


In ScyllaDB, this will cause starting the service to fail, no?

Maybe it's better to do what we can and return without failure (logging an error), otherwise we leave the machine useless.

In ScyllaDB, this will cause starting the service to fail, no?

Maybe it's better to do what we can and return without failure (logging an error), otherwise we leave the machine useless.

I do understand your concern, but I think it should be implemented in scylla_prepare script in scylla-core repo, not perftune.py itself.
I think exit with non-zero status is correct behavor for perftune.py script itself, since there is no way to continue the tuneup.

Also, I found that when we try to enable perftune.py by scylla_sysconfig_setup --setup-nic-and-disks --nic eth0, scylla_sysconfig_setup will try to run perftune.py to get the cpumask and it causes error when there is corrupted NUMA topology information.
So scylla_sysconfig_setup won't able continue to run, it never set SET_NIC_AND_DISKS=yes to the sysconfig file.
So users likely not able to enable perftune.py, they will get error when they running the setup script.

But there is exception, if user run scylla_sysconfig_setup --set-clocksource, it will just enable perftune.py without any error, since the script does not run perftune.py to get the cpumask.
So user may suffer error when they start scylla-server.service.

I guess we should have check_sysfs_numa_topology_is_valid() on scylla_sysconfig_setup too just like we do on this patch, to avoid enabling perftune when the NUMA topology is not valid.

So I think I need to make the another PR on scylla-core repo, which handles perftune.py error on scylla_prepare, and check the NUMA topology on scylla_sysconfig_setup.

Let's have a special error code for this, so the caller can understand that this is an expected failure.

Implemented special error code, and also implemented scylla-core repository part of the patch:
scylladb/scylladb#26344

There are some environment which has corrupted NUMA topology information, such as some instance types on AWS EC2 with specific Linux kernel images. On such environment, we cannot get HW information correctly from hwloc, so we cannot proceed optimization on perftune. It should print readable error message and exit perftune.py with non-zero exit code. Fixes scylladb#2925 Signed-off-by: Takuya ASADA <[email protected]>

syuu1228 mentioned this pull request Sep 1, 2025

perftune.py: fail on im4gn.8xlarge and im4gn.16xlarge #2925

Open

avikivity reviewed Sep 2, 2025

View reviewed changes

syuu1228 force-pushed the perftune_detect_corrupted_numa_topology branch from 8f40926 to c2b2b55 Compare September 30, 2025 18:04

syuu1228 mentioned this pull request Sep 30, 2025

dist: detect corrupted NUMA topology information scylladb/scylladb#26344

Open

syuu1228 requested a review from yaronkaikov October 1, 2025 03:15

syuu1228 mentioned this pull request Oct 1, 2025

scylla_image_setup: avoid script error when perftune.py failed scylladb/scylla-machine-image#787

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scripts/perftune.py: detect corrupted NUMA topology information #2956

scripts/perftune.py: detect corrupted NUMA topology information #2956

Uh oh!

syuu1228 commented Sep 1, 2025

Uh oh!

avikivity Sep 2, 2025

Uh oh!

syuu1228 Sep 9, 2025

Uh oh!

avikivity Sep 9, 2025

Uh oh!

syuu1228 Sep 30, 2025

Uh oh!

Uh oh!

scripts/perftune.py: detect corrupted NUMA topology information #2956

Are you sure you want to change the base?

scripts/perftune.py: detect corrupted NUMA topology information #2956

Uh oh!

Conversation

syuu1228 commented Sep 1, 2025

Uh oh!

avikivity Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

syuu1228 Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

avikivity Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

syuu1228 Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!