-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tablets] Requests served imbalance after adding nodes to cluster #19107
Comments
Seems like a driver-side load balancing issue. What's the driver's load balancing policy? |
Which metric(s) are imbalanced? |
Coordinator work is unbalanced. Replica work is balanced. There are writes only. CL doesn't matter, since it's writes only. The test starts with 3 nodes (7 shards on each node, 18 tablets replicated on each shard). Initially, work is perfectly balanced across coordinators. Then, 3 nodes are bootstrapped in parallel. As soon as the bootstrap starts, the balance shatters — one of the 3 original nodes starts handling 80% of requests, while the other two handle 10% each. (I'm only showing the per-instance graph here, because shards within each instance are mostly symmetric, so the per-shard view isn't very interesting in this case). The coordinators never return to balance after that, even after replica work is eventually balanced perfectly.
Hypothesis: |
I don't know the answer for this, cassandra-stress default is used. |
Summoning @piodul |
I have no idea about the java driver's implementation of support for tablets. I might be wrong, but AFAIK @Bouncheck implemented it and @Lorak-mmk was reviewing it, so they might have some ideas. |
I'm reviewing implementation in Java Driver 4.x (btw why isn't c-s using 4.x?), I didn't really look at 3.x implementation |
cc: @avelanarius are there other people from your team who could take a look at it? |
@soyacz would it be hard to run similar scenario using cql-stress's c-s mode? Current master of cql-stress uses Rust Driver 0.13 which has tablet awareness. |
I can try, if throttling is supported should work without much effort (only prepare stage needs to be adjusted possibly to not overload cluster). But first would be good if we updated cql-stress to newest release with newest drivers, see scylladb/scylla-cluster-tests#7582 |
I see in that issue that @fruch managed to build the new version. Is there anything blocking the update? |
trying it. |
test unfortunately failed before growing cluster due parsing cql-stress result. Issue created: scylladb/cql-stress#95 |
In this case the problem is different. As you can see, this time coordinator work is directly proportional to replica work — which means that this time load balancing works. This time, the imbalance doesn't come from bad load balancing on the client side, but from bad balancing of tablets on the server. cql-stress apparently creates two tables instead of one — Note that there is still a high rate of cross-shard ops. I've said earlier that "shard awareness seems to break down", but I've just realized that this isn't true — it's a server-side issue. Shards only communicate with their siblings on other nodes. With vnodes, a replica set for a given token range is always replicated on a set of sibling shards, so shards can send replica requests to their siblings directly, and there are no cross-shard ops. With tablets, there is no such property — on different nodes, the same tablet will be replicated on shards with different numbers, so cross-shard ops are unavoidable. |
So, to sum up: with cql-stress, the results look (to me) as expected. So it would appear that the problem is with the java driver. (And the problem is probably just with load balancing. I was wrong earlier about requests being sent to non-replica shards). |
In that case @Bouncheck will be the best person to investigate this |
@michoecho regarding the cql-stress and counter table, something doesn't adds up. |
Correct. What about this doesn't add up? Tablet load balancer doesn't care about traffic, only the number of tablets.
Then apparently the codebase didn't get the memo, because they aren't rejected. |
issue number? |
Are you asking whether there is an existing ticket for this, or are you asking me to create one? Opened #19449. |
Either...
Thanks! |
I run the following load (starts at a small 3-node i4i.xlarge cluster) :
cargo run --release --bin cql-stress-cassandra-stress -- write n=100M cl=local_quorum keysize=100 -col n=5 size='FIXED(200)' -mode cql3 -rate throttle=120000/s threads=8 -pop seq=1..100M -node 172.31.16.15
cargo run --release --bin cql-stress-cassandra-stress -- mixed duration=6h cl=local_quorum keysize=100 'ratio(read=8,write=2)' -col n=5 size='FIXED(200)' -mode cql3 -rate throttle=120000/s threads=32 -pop seq=1..1M -node 172.31.16.15
---
- name: Double cluster-size
hosts: double_cluster
become: True
tasks:
- name: Start ScyllaDB Service
ansible.builtin.systemd_service:
name: scylla-server.service
state: started
- name: Waiting for CQL port readiness
wait_for:
port: 9042
host: 127.0.0.1
connect_timeout: 3
delay: 3
sleep: 10
timeout: 1200
state: present Considering you added enough data in (1), you'll see the problem shortly after you run 3 above, and throughput and clients won't recover until the full tablet migration is complete. As soon as you see the Warning/Error in logs, restart the client - You will notice the driver will only route traffic to the contact point you specified on the command line. |
They don't forget the whole mapping. When they send a statement to wrong node, they will get a payload with correct tablet for this statement. Then the driver will remove from it's local mapping tablets that overlap with newly received one and insert the newly received one. |
On Wed, Jun 26, 2024 at 5:07 PM Kamil Braun ***@***.***> wrote:
So, I re-read the thread again from the beginning, this time carefully...
@michoecho <https://github.com/michoecho> already confirmed before that
this problem is tablets specific.
#19107 (comment)
<#19107 (comment)>
#19107 (comment)
<#19107 (comment)>
IIUC there are two imbalance-related bugs:
- one in java driver's implementation of load balancing for tablets
(cc @Lorak-mmk <https://github.com/Lorak-mmk>)
- and one in Scylla -- bad balancing of tablet replicas across nodes
(cc @tgrabiec <https://github.com/tgrabiec>)
#19107 (comment)
<#19107 (comment)>
^ This sounds like #16824
Message ID: ***@***.***>
… |
@dimakr - can you please ensure there's nothing to do here in any of the drivers? |
I assume the question is for @dkropachev |
|
I'll prioritize investigating the java driver (3.x) side now |
There is definitely a problem on java-drver 3.x side with imbalanced load after nodes are added. |
So, we can close it, it is missconfiguration on c-s part that lead to this. does anyone have reason to keep it open ? |
Do we have a version of c-s that "fixes" it? |
Not now, this change was done on |
I've created scylladb/cassandra-stress#32 to switch back to |
IIUC the comments in scylladb/scylla-cluster-tests#9296 the fix for c-s solves the imbalance but doesn't improve or fix the latency issue (https://github.com/scylladb/scylla-enterprise/issues/4504#issuecomment-2495859043) |
@roydahan This thread doesn't say anything about latency, though. Since this thread is about the coordinator imbalance, which has been addressed, we can probably close it. |
Yes, I agree. |
@dkropachev is there an ETA for fixing this issue? |
It's already fixed with latest c-s release. |
Packages
Scylla version:
6.1.0~dev-20240528.519317dc5833
with build-id75e8987548653166f5131039236650c1ead746f4
Kernel Version:
5.15.0-1062-aws
Issue description
Test scenario covering case of scaling out cluster from 3 nodes to 6, with new nodes added in parallel.
During adding node we can see one node takes over most of the cluster load while the rest served requests drop significantly.
Also, after growing, requests served are still not balanced (before grow we can see all nodes serving equally).
Test uses c-s with java driver 3.11.5.2 which is tablet aware.
Impact
Degraded performance
How frequently does it reproduce?
Reproduces in all tablets elasticity tests (write, read, mixed)
Installation details
Cluster size: 3 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0a070c0d6ef92b552
(aws: undefined_region)Test:
scylla-master-perf-regression-latency-650gb-grow-shrink
Test id:
f417745e-0067-4479-95ee-24c9182267ce
Test name:
scylla-staging/lukasz/scylla-master-perf-regression-latency-650gb-grow-shrink
Test config file(s):
Logs and commands
$ hydra investigate show-monitor f417745e-0067-4479-95ee-24c9182267ce
$ hydra investigate show-logs f417745e-0067-4479-95ee-24c9182267ce
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: