-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starting scylla server and populating keyspace causes "Storage I/O error: 28: No space left on device" and shutdown #22020
Comments
@yarongilor - I'm not sure what was the expected behavior, without increasing the cluster size? |
First, @yarongilor I think that the title is not good enough. |
@mykaul , the scenario description is rephrased to: Run a setup of 90% disk utilization. The massive write load is stopped once getting to 90%. So the bottom line question is why scylla service start triggers something like 64GB additional disk space. |
it's all about populating keyspace1 and specifically the compaction splitting sstables:
|
A different (but possibly related?) problem is found on another run during nemesis of restart_then_repair:
PackagesScylla version: Kernel Version: Issue description
Describe your issue in detail and steps it took to produce it. ImpactDescribe the impact this issue causes to the user. How frequently does it reproduce?Describe the frequency with how this issue can be reproduced. Installation detailsCluster size: 3 nodes (i4i.large) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
Based on the scenario you are describing, it should be reasonably easy to create a non-SCT reproducer for this issue, right? |
Not sure it's too easy/available |
what would be the problematic part, limiting storage space? I think, we could demonstrate the issue with lower utilization but visible spike in storage space after restart |
Could be commitlog replay. Did you shut down the server cleanly? @elcallio clean shutdown marks all commitlog segments as free, does it not? |
64GB is exactly the amount of space allocated to commitlog on i4i.4xlarge. However, half of it should be free under normal operation (and all of it after a clean shutdown). |
The shutdown was basically clean like:
|
The issue is reproduced on another "rolling restart" nemesis of
PackagesScylla version: Kernel Version: Issue description
Describe your issue in detail and steps it took to produce it. ImpactDescribe the impact this issue causes to the user. How frequently does it reproduce?Describe the frequency with how this issue can be reproduced. Installation detailsCluster size: 3 nodes (i4i.large) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
Packages
Scylla version:
2024.3.0~dev-20241215.811c9ccb7f91
with build-idcf31bbad95480fbbafa9cb498cf0a54cd58c7485
Kernel Version:
6.8.0-1020-aws
Issue description
Describe your issue in detail and steps it took to produce it.
Run a setup of 90% disk utilization. The massive write load is stopped once getting to 90%.
Ran an SCT nemesis of
disrupt_stop_wait_start_scylla_server
that stops scylla for 5 minutes then start it back.No significant load on cluster at that time - mainly a read load and perhaps a 3 writes per second.
So no disk utilization growth is expected at this time. Yet, once scylla started - the utilization started increasing as well, unexpectedly.
The scylla service is started and the node disk utilization is gradually increased until reaches 100% and the start command fails after 10 minutes.
log of node-3 event is:
node-3 failure:
grafana shows the node's disk utilization clibms up to 100%:
SCT failure error event:
Impact
node is down.
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 3 nodes (i4i.large)
Scylla Nodes used in this run:
OS / Image:
ami-0a3508c8059b5dc39
(aws: undefined_region)Test:
byo-longevity-test-yg2
Test id:
2c23b329-4757-4f06-b60c-fc222590dcf4
Test name:
scylla-staging/yarongilor/byo-longevity-test-yg2
Test method:
longevity_test.LongevityTest.test_custom_time
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 2c23b329-4757-4f06-b60c-fc222590dcf4
$ hydra investigate show-logs 2c23b329-4757-4f06-b60c-fc222590dcf4
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: