Skip to content

Conversation

syuu1228
Copy link
Contributor

@syuu1228 syuu1228 commented Sep 1, 2025

There are some environment which has corrupted NUMA topology information, such as some instance types on AWS EC2 with specific Linux kernel images.
On such environment, we cannot get HW information correctly from hwloc, so we cannot proceed optimization on perftune.
It should print readable error message and exit perftune.py with non-zero exit code.

Fixes #2925

else:
if not check_sysfs_numa_topology_is_valid():
raise Exception("NUMA topology information is corrupted")
self.irqs_cpu_mask = auto_detect_irq_mask(self.cpu_mask, self.cores_per_irq_core)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ScyllaDB, this will cause starting the service to fail, no?

Maybe it's better to do what we can and return without failure (logging an error), otherwise we leave the machine useless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ScyllaDB, this will cause starting the service to fail, no?

Maybe it's better to do what we can and return without failure (logging an error), otherwise we leave the machine useless.

I do understand your concern, but I think it should be implemented in scylla_prepare script in scylla-core repo, not perftune.py itself.
I think exit with non-zero status is correct behavor for perftune.py script itself, since there is no way to continue the tuneup.

Also, I found that when we try to enable perftune.py by scylla_sysconfig_setup --setup-nic-and-disks --nic eth0, scylla_sysconfig_setup will try to run perftune.py to get the cpumask and it causes error when there is corrupted NUMA topology information.
So scylla_sysconfig_setup won't able continue to run, it never set SET_NIC_AND_DISKS=yes to the sysconfig file.
So users likely not able to enable perftune.py, they will get error when they running the setup script.

But there is exception, if user run scylla_sysconfig_setup --set-clocksource, it will just enable perftune.py without any error, since the script does not run perftune.py to get the cpumask.
So user may suffer error when they start scylla-server.service.

I guess we should have check_sysfs_numa_topology_is_valid() on scylla_sysconfig_setup too just like we do on this patch, to avoid enabling perftune when the NUMA topology is not valid.

So I think I need to make the another PR on scylla-core repo, which handles perftune.py error on scylla_prepare, and check the NUMA topology on scylla_sysconfig_setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have a special error code for this, so the caller can understand that this is an expected failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented special error code, and also implemented scylla-core repository part of the patch:
scylladb/scylladb#26344

There are some environment which has corrupted NUMA topology
information, such as some instance types on AWS EC2 with specific Linux
kernel images.
On such environment, we cannot get HW information correctly from hwloc,
so we cannot proceed optimization on perftune.
It should print readable error message and exit perftune.py with
non-zero exit code.

Fixes scylladb#2925

Signed-off-by: Takuya ASADA <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

perftune.py: fail on im4gn.8xlarge and im4gn.16xlarge
2 participants