-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce support for tablets #249
Introduce support for tablets #249
Conversation
a436581
to
960cb95
Compare
960cb95
to
e1e572a
Compare
71a5c4f
to
ef837ab
Compare
ef837ab
to
94b8dc8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks much better now
94b8dc8
to
70e86d5
Compare
oops, it was not supposed to be "Approve" |
70e86d5
to
882a6f1
Compare
882a6f1
to
2d7890f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now. Please add the comment that I requested and I think we can merge this.
2d7890f
to
55ea9bf
Compare
In gocql @avelanarius said we should wait with merging until the events are available, I don't know if this also applies to python-driver |
cassandra/pool.py
Outdated
tablet = self._session.cluster._load_balancing_policy._cluster_metadata._tablets.get_tablet_for_key(keyspace, table, t) | ||
|
||
if tablet is not None: | ||
shard_id = tablet.replicas[0][1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you always take the shard of the first replica? This pool may be for a different replica.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how to know which shard to take? Isn't it redundant? I thought the choice didn't matter so I took the first one, maybe I misunderstood something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the current two-pass implementation (first we determine the node to send the request to, then we determine the shard to send the request to), I think this code has to go through all replicas, find the replica that matches the HostConnection
's host
and read its shard number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sylwiaszunejko IIUC this function chooses shard-connection to a particular host for a given key. The owning shard of a given key is different on different hosts. It's specified in the <host, shard> tuple in the replica list. So the shard should depend on which host-replica you're going to talk to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, now I get it, hope now it looks good
55ea9bf
to
4ba3fc6
Compare
@Lorak-mmk @avelanarius I removed experimental setting |
ff00514
to
b0a571e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some final minor nits, otherwise looks good to me
In order for Scylla to send the tablet info, the driver must tell the database during connection handshake that it is able to interpret it. This negotation is added as a part of ProtocolFeatures class.
b0a571e
to
c2ca988
Compare
c2ca988
to
d160241
Compare
cassandra/cluster.py
Outdated
@@ -1762,6 +1764,8 @@ def connect(self, keyspace=None, wait_for_all_pools=False): | |||
try: | |||
self.control_connection.connect() | |||
|
|||
self.load_balancing_policy.populate(weakref.proxy(self), self.metadata.all_hosts()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this code was here before most recent pushes, I don't remember seeing it.
Why do you need to pass weakref? Is the cluster saved somewhere?
Why do we need to call this here? I see that populate is called in other places already, so why another call here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we were using experimental setting, it was known from the beginning that we are using tablets. So, in every place where self.load_balancing_policy.populate()
was called, it was properly informed to use tablets. But now, we have to get the information about using tablets from connection handshake. Therefore, when we are performing populate on lbp here: https://github.com/scylladb/python-driver/blob/master/cassandra/cluster.py#L1744, we do not know yet if tablets are being used or not, so there is a need to call populate
again after calling connect()
on control_connection
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but please put proper explanation in the comment, this code is really not obvious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sylwiaszunejko It looks like this line specifically causes the CI failures. Since pretty much all integration tests don't use tablets (and your code shouldn't affect them), I tested progressively removing more and more of your changes and removing this line fixed the failing tests. I don't yet have an explanation why it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Branch on which I tested: https://github.com/avelanarius/python-driver/commits/tablets_disabled3/, "no populate" fixed the failing tests ("Build and upload to PyPi" are failing for some other reason)
I'll also look into test failures today and try to debug them. |
Some initial investigation (still waiting for some runs to finish): master (without this PR) reliably passes, with PR rebased on current master fails. Now looking into what part of this PR causes the test failure. |
d075881
to
3ffc22c
Compare
@avelanarius I added documentation |
@@ -1775,6 +1777,9 @@ def connect(self, keyspace=None, wait_for_all_pools=False): | |||
self.shutdown() | |||
raise | |||
|
|||
# Update the information about tablet support after connection handshake. | |||
self.load_balancing_policy._tablets_routing_v1 = self.control_connection._tablets_routing_v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it necessary to set _tablets_routing_v1
also on child policies? This is what populate()
also did. Maybe the code doesn't work properly if the set policy is some policy that has token aware policy as child - the child policy (token aware policy) won't have _tablets_routing_v1
set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, do you think that creating another function only to populate _tablets_routing_v1
will be a good solution? Or should I investigate more why calling populate
was causing issues and return to calling it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea to investigate why populate was causing issues.
3ffc22c
to
4a8f8cb
Compare
Add mechanism to parse system.tablets periodically. In TokenAwarePolicy check if keyspace uses tablets if so try to use them to find replicas. Make shard awareness work when using tablets. Everything is wrapped in experimental setting, because tablets are still experimental in ScyllaDB and changes in the tablets format are possible.
4a8f8cb
to
eaa9eb1
Compare
This PR introduces changes to the driver that are necessary for shard-awareness and token-awareness to work effectively with the tablets.
Now if driver send the request to the wrong node/shard it will get the correct tablet information from Scylla in
custom_payload
. It uses this information to obtain target replicas and shard numbers for tables managed by tablet replication.I also added parsing TABLETS_ROUTING_V1 extension to ProtocolFeatures. In order for Scylla to send the tablet info, the driver must tell the database during connection handshake that it is able to interpret it. This negotiation is added as a part of ProtocolFeatures class.
To test this change I created integration and unit tests.
Fixes: #281