Skip to content

Conversation

@syuu1228
Copy link
Contributor

@syuu1228 syuu1228 commented Sep 1, 2025

On EC2, some environments cause errors on perftune.py due to corrupted
NUMA topology information.
Even in such an environment, scylla_image_setup should not cause a script error.
It should handle the error, print a warning message, and then continue the startup process.

Related scylladb/seastar#2925


This is scylla-machine-image part of the fix for scylladb/seastar#2925

@yaronkaikov
Copy link
Collaborator

@syuu1228 please rebase to latest next


LOGGER = logging.getLogger(__name__)

def disable_perftune():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@syuu1228 don't we need perftune running ? what is the impact of disabling it?

@yaronkaikov
Copy link
Collaborator

@syuu1228 please rebase , it seems you 2 additional commits that don't suppose to be here

@syuu1228 syuu1228 force-pushed the scylla_image_setup_avoid_traceback_on_perftune_error branch from 7b3c918 to c780e71 Compare September 8, 2025 16:34
@syuu1228
Copy link
Contributor Author

syuu1228 commented Sep 8, 2025

Rebased with latest next

@syuu1228 syuu1228 force-pushed the scylla_image_setup_avoid_traceback_on_perftune_error branch 2 times, most recently from 78ebc60 to 7dba512 Compare September 9, 2025 07:07
@syuu1228 syuu1228 force-pushed the scylla_image_setup_avoid_traceback_on_perftune_error branch 2 times, most recently from adbb50a to ab18dcf Compare September 30, 2025 18:31
@syuu1228
Copy link
Contributor Author

Added code to handle special exit code, which is added scylla_sysconfig_setup on scylladb/scylladb#26344

@syuu1228
Copy link
Contributor Author

syuu1228 commented Oct 1, 2025

@syuu1228
Copy link
Contributor Author

syuu1228 commented Oct 1, 2025

@avikivity Please review this as well, this is scylla-machine-image part of scylladb/seastar#2956

@yaronkaikov
Copy link
Collaborator

@avikivity Please review this as well, this is scylla-machine-image part of scylladb/seastar#2956

@avikivity ping review

except subprocess.CalledProcessError as e:
if e.returncode == 3:
disable_perftune()
LOGGER.warning('Failed to enable perftune.py, continue without using it')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaronkaikov one problem this change may cause is that our tests will miss these failures.
We may need to introduce a specific check that this warning isn't there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have talked with @syuu1228 about this, we should probably raise this message only for the known instance types with the issues we know of, other wise when we hit a regression we would never find it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the limitation. Now it avoid backtrace only on im4gn.8xlarge and im4gn.16xlarge, which are the target of this patch. On other instance type or on other IaaS, it just causes backtrace.

@syuu1228 syuu1228 force-pushed the scylla_image_setup_avoid_traceback_on_perftune_error branch 2 times, most recently from ec7b1dc to 14dd0ae Compare October 20, 2025 16:38
except subprocess.CalledProcessError as e:
if e.returncode == 3:
disable_perftune()
LOGGER.warning('Failed to enable perftune.py, continue without using it')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have talked with @syuu1228 about this, we should probably raise this message only for the known instance types with the issues we know of, other wise when we hit a regression we would never find it

@syuu1228 syuu1228 force-pushed the scylla_image_setup_avoid_traceback_on_perftune_error branch 3 times, most recently from 0feeab4 to 43b1037 Compare October 22, 2025 14:57
@syuu1228
Copy link
Contributor Author

@syuu1228
Copy link
Contributor Author

@gmizrahi Please test this PR, here's AMI ID and version tag:
us-east-1-x86_64: ami-0a14fe7e33f0e4694
2026.1.0~dev-0.20251022.ab488fbb3f95

@syuu1228
Copy link
Contributor Author

@yaronkaikov please review updated patch

yaronkaikov
yaronkaikov previously approved these changes Oct 23, 2025
@yaronkaikov yaronkaikov force-pushed the scylla_image_setup_avoid_traceback_on_perftune_error branch from 43b1037 to 498c324 Compare October 23, 2025 07:45
@gmizrahi
Copy link

@yaronkaikov yaronkaikov dismissed their stale review October 23, 2025 09:33

The merge-base changed after approval.

Switch from print() to logging API on scylla_image_setup.
On EC2, some environments cause errors on perftune.py due to corrupted
NUMA topology information.
Even in such an environment, scylla_image_setup should not cause a script error.
It should handle the error, print a warning message, and then continue the startup process.

Related scylladb/seastar#2925
@syuu1228 syuu1228 force-pushed the scylla_image_setup_avoid_traceback_on_perftune_error branch from 498c324 to ba96671 Compare October 23, 2025 10:10
@syuu1228
Copy link
Contributor Author

@avikivity Please review this as well, this is scylla-machine-image part of scylladb/seastar#2956

@avikivity ping

@syuu1228
Copy link
Contributor Author

longevity failed, but the error is known issue so we can ignore this (described at #712 (comment)): https://jenkins.scylladb.com/view/master/job/scylla-master/job/releng-testing/job/longevity/job/longevity-100gb-4h-test/110/

@syuu1228
Copy link
Contributor Author

@yaronkaikov all tests passed, I think we can merge this

@yaronkaikov yaronkaikov merged commit 646ec56 into scylladb:next Oct 23, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants