-
Notifications
You must be signed in to change notification settings - Fork 1.6k
scripts/perftune.py: detect corrupted NUMA topology information #2956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
scripts/perftune.py: detect corrupted NUMA topology information #2956
Conversation
else: | ||
if not check_sysfs_numa_topology_is_valid(): | ||
raise Exception("NUMA topology information is corrupted") | ||
self.irqs_cpu_mask = auto_detect_irq_mask(self.cpu_mask, self.cores_per_irq_core) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ScyllaDB, this will cause starting the service to fail, no?
Maybe it's better to do what we can and return without failure (logging an error), otherwise we leave the machine useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ScyllaDB, this will cause starting the service to fail, no?
Maybe it's better to do what we can and return without failure (logging an error), otherwise we leave the machine useless.
I do understand your concern, but I think it should be implemented in scylla_prepare script in scylla-core repo, not perftune.py itself.
I think exit with non-zero status is correct behavor for perftune.py script itself, since there is no way to continue the tuneup.
Also, I found that when we try to enable perftune.py by scylla_sysconfig_setup --setup-nic-and-disks --nic eth0
, scylla_sysconfig_setup will try to run perftune.py to get the cpumask and it causes error when there is corrupted NUMA topology information.
So scylla_sysconfig_setup won't able continue to run, it never set SET_NIC_AND_DISKS=yes to the sysconfig file.
So users likely not able to enable perftune.py, they will get error when they running the setup script.
But there is exception, if user run scylla_sysconfig_setup --set-clocksource
, it will just enable perftune.py without any error, since the script does not run perftune.py to get the cpumask.
So user may suffer error when they start scylla-server.service.
I guess we should have check_sysfs_numa_topology_is_valid()
on scylla_sysconfig_setup too just like we do on this patch, to avoid enabling perftune when the NUMA topology is not valid.
So I think I need to make the another PR on scylla-core repo, which handles perftune.py error on scylla_prepare, and check the NUMA topology on scylla_sysconfig_setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have a special error code for this, so the caller can understand that this is an expected failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented special error code, and also implemented scylla-core repository part of the patch:
scylladb/scylladb#26344
There are some environment which has corrupted NUMA topology information, such as some instance types on AWS EC2 with specific Linux kernel images. On such environment, we cannot get HW information correctly from hwloc, so we cannot proceed optimization on perftune. It should print readable error message and exit perftune.py with non-zero exit code. Fixes scylladb#2925 Signed-off-by: Takuya ASADA <[email protected]>
8f40926
to
c2b2b55
Compare
There are some environment which has corrupted NUMA topology information, such as some instance types on AWS EC2 with specific Linux kernel images.
On such environment, we cannot get HW information correctly from hwloc, so we cannot proceed optimization on perftune.
It should print readable error message and exit perftune.py with non-zero exit code.
Fixes #2925