Skip to content

CNDB-13129: Add index hints#1670

Merged
adelapena merged 2 commits intomainfrom
CNDB-13129-main
Jul 25, 2025
Merged

CNDB-13129: Add index hints#1670
adelapena merged 2 commits intomainfrom
CNDB-13129-main

Conversation

@adelapena
Copy link
Copy Markdown

@adelapena adelapena commented Apr 2, 2025

Add optional index hints to SELECT queries, allowing users to specify preferences about what indexes should be used or skipped. The syntax is:

SELECT ... FROM ... WHERE ... WITH preferred_indexes = {idx1,idx2} AND excluded_indexes={idx3,idx4};

The indexes would do their best to use the preferred indexes, and it's guaranteed that the excluded indexes will indeed be excluded.

The query will fail if any of the mentioned indexes don't exist. However, it will be accepted if some indexes do not apply to the query expressions. The latter is according to this Slack discussion, but it's an easy-to-change thing that we can always revisit.

Besides performance, the hints can also be used to disambiguate queries where a restricted column has multiple indexes that return different results. For example, we can have analyzed and not-analyzed indexes in the same column. This would normally throw an exception in this case, but if we add hints to prefer or exclude one of the indexes the query will work:

CREATE TABLE t(k int PRIMARY KEY, v text);
CREATE CUSTOM INDEX idx1 ON t(v) USING 'StorageAttachedIndex';
CREATE CUSTOM INDEX idx2 ON t(v) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': 'standard' };
SELECT * FROM t WHERE v = '...'; # rejected query due to ambiguity
SELECT * FROM t WHERE v = '...' WITH preferred_indexes = {idx1}; # uses the not-analyzed index
SELECT * FROM t WHERE v = '...' WITH preferred_indexes = {idx2}; # uses the analyzed index

Similarly, the hints can be used to access functionality normally shaded by the index. For example, an analyzed index will shade ALLOW FILTERING's full-value equality, but we can use hints to exclude that index and get access to the not-indexed behaviour:

CREATE TABLE t(k int PRIMARY KEY, v text);
CREATE CUSTOM INDEX idx ON t(v) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': 'standard' };
SELECT * FROM t WHERE v = '...' ALLOW FILTERING; uses the analyzed index
SELECT * FROM t WHERE v = '...' ALLOW FILTERING WITH excluded_indexes = {idx}; # uses not-analyzed filtering

The same type of shading happens with [NOT] CONTAINS [KEY].

This also includes an internal messaging version bump, since we need to transport the hints from the coordinator to the replicas.

In terms of SAI's internal query planner, the excluded indexes are always excluded, whereas the preferred indexes won't be pruned by calls to Plan#optimize(). Thus, in a query with two indexes, if one has more selectivity than the other, but the index with less selectivity is preferred, then both indexes will be selected. If one wants to use that preferred index exclusively, then the other indexes can be included in excluded_indexes. I've been tempted to tamper with the selectivity of the preferred indexes so they are selected over other indexes, but I think it's better to keep it this way.

@adelapena adelapena self-assigned this Apr 2, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 2, 2025

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@adelapena adelapena marked this pull request as draft April 2, 2025 11:38
@adelapena adelapena marked this pull request as ready for review April 11, 2025 16:42
@adelapena adelapena marked this pull request as draft April 11, 2025 16:43
@adelapena adelapena marked this pull request as ready for review April 17, 2025 18:05
@adelapena adelapena force-pushed the CNDB-13129-main branch 3 times, most recently from 23113cd to 5b0c7ab Compare April 30, 2025 14:23
@adelapena adelapena changed the title CNDB-13129: Add index hints (WIP) CNDB-13129: Add index hints May 1, 2025
Comment thread src/java/org/apache/cassandra/db/filter/IndexHints.java Outdated
Comment on lines 139 to 292
Copy link
Copy Markdown

@pkolaczk pkolaczk May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So excluded are strongly excluded, but preferred are just softly preferred.
In this case the naming is ok, but the top-level javadoc is misleading, as it suggests hints are only hints and the system is allowed to pick an excluded index.

Shouldn't we return an excluded index if there are no non-excluded indexes though?
This lack of symmetry feels weird.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that the indexes will be included or excluded if possible, given the queried expressions and the index layout. And included index cannot be used for example if there are no expressions for it, or if we have two preferred indexes in separate index groups. This was done to prevent some validation as discussed on this thread. Index exclusion also follows this criteria, applying if possible, but the difference is that it's always possible to exclude an index by simply filtering.

However, I think it might be better to use full validation and make both things strong, going back to included_indexes and excluded_indexes, and reject queries where it's not possible to satisfy the index hints/directives. That way, we would reject queries where:

  • There are included indexes without expressions for those indexes
  • There are multiple included indexes in different index groups

What do you think?

Regarding the name, index hints, I think it's not that unusual to use this name even if it's more about mandatory selections than clues for the optimizer. For example, MySQL does that for FORCE INDEX and IGNORE INDEX. But perhaps we could rename IndexHints to IndexSelection, or something like that.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like going with strong hints included / excluded. I like predictable :)
I just started reviewing from reading the javadoc at the top of the IndexHints class and it said hints are non-mandatory, so the asymmetry caught my eye.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gone back to included/excluded, rejecting queries where it's not possible to used the included indexes.

In that Slack thread we were worried about having too complex validation, akin to ALLOW FILTERING. However, it can be done in a quite straightforward way if we simply build the Index.QueryPlan trying to include the requested indexes and fail the query if the plan doesn't end up containing the included indexes.

That can happen if the query doesn't have expressions for the included indexes, or if there is an OR mixing indexed and not-indexed columns, or in general if the index implementation cannot use the indexes for any reason. In that case, we just throw an exception saying It wasn't possible to select the indexes a, b, c, without entering into much detail, the same way we don't give details about why a query needs ALLOW FILTERING.

Copy link
Copy Markdown
Member

@k-rus k-rus May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an OR mixing indexed and not-indexed columns

This is good point for providing softly included index hints.

I have gone back to included/excluded, rejecting queries where it's not possible to used the included indexes.

To my memory the agreement was to have softly included indexes. I hope I haven't missed later updates. If it's to change, then it might be good to have another conversion on Slack as this is user affecting functionality, which can be difficult to change later.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion thread here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO still relevant?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is! I had forgotten about it. Considering hints here can be used to skip vector index column validation if the index is excluded. Creating an index on a vector column shades equality queries with ALLOW FILTERING that would normally work without the index. Index hints should be able to disambiguate this:

CREATE TABLE t (k int PRIMARY KEY, v vector<float, 2>);
SELECT k FROM t WHERE v = [0.1, 0.2] ALLOW FILTERING; -- works
CREATE CUSTOM INDEX idx ON t(v) USING 'StorageAttachedIndex';
SELECT k FROM t WHERE v = [0.1, 0.2] ALLOW FILTERING; -- doesn't work anymore!
SELECT k FROM t WHERE v = [0.1, 0.2] ALLOW FILTERING  WITH excluded_indexes={idx}; -- works again

I've just pushed a fix.

Comment on lines 372 to 360
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one more place where we need to do a similar thing in the Plan class: stripSubplans method.

        {
            if (subplansSupplier.orig.size() <= clauseLimit)
                return this;
            List<Plan.KeysIteration> newSubplans = new ArrayList<>(subplansSupplier.orig.subList(0, clauseLimit));
            return factory.intersection(newSubplans, id).withAccess(access);
        }

I think we should now enhance this logic to keep the subplans with the preferredIndexes before the others.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I wasn't aware of that. I have changed it to include all the included indexes, since now it's guaranteed that they will be used, and then the others if they don't exceed the limit. This means that manually included indexes won't follow the intersection limit, but it could be the other way around, or perhaps we could just fail the query. wdyt?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking it better, it seems that ignoring either the included indexes or the intersection clause limit is problematic:

  • If we ignore the included indexes, we break the promise that included indexes are included no matter what.
  • If we ignore the intersection clause limit, we open a door to silently go around what is currently a system limit, maybe causing system stability issues. If we want to allow users to skip that, probably it's better to have that separate, explicit, WITH intersection_clause_limit = N option. That option can then be guardrailed, so we have a default limit and a maximum manual value.

So I'm adding some validation to reject included_indexes options that would exceed the intersection clause limit.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's about to move the required index up, i.e., reorder applying indexes, so the amount of indexes is within the limit?
If a user hinted an index for the performance reason, imho, it's appropriate to overwrite the optimizer decision. At least this is how relational databases work.
Still I want to pay attention that current agreement was in the direction of the soft hints.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch already does that in Plan.Intersection#stripSubplans. It takes included/preferred indexes up to the limit first, and then it takes the rest of the indexes, if the limit allows it.

Then there is also validation in StoragaAttachedIndex.checkHintsDoesntExceedIntersectionClauseLimit to reject queries where the number of included indexes within an intersection is greater than the intersection clause limit.

Copy link
Copy Markdown

@pkolaczk pkolaczk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, albeit I think there is one minor problem where we might strip a preferred index over a non-preferred index due to the intersection clause limit, because we just keep the first N subplans (they are sorted by selectivity and we don't check the hints there).

There follows a question - should we make an exception in the intersection limit rule for the preferred indexes? If the user includes 3 indexes in the preferred set, but the intersection clause limit is 2, should we pick the best 2 preferred indexes, or ignore the limit and take all 3? The more I think about it, the more I like the idea that preferred_indexes should ignore the limit, because not including an index from that list would be susrprising for the user.

// edit: if we change to included_indexes and the inclusion hint forces the index, then by no doubt, we should not apply the limit to exclude those indexes

And BTW: an idea for a followup feature: could we have WITH intersection_clause_limit = N option? ;)

Copy link
Copy Markdown
Member

@k-rus k-rus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just started to review and I feel it be better that I provide my feedback partially.
In my current portion:

  • I have a question if hints are only applied to equality or for ranges too.
  • Couple of nits on formulations in comments
  • A nit on implementation

Comment thread src/java/org/apache/cassandra/config/CassandraRelevantProperties.java Outdated
Comment thread src/java/org/apache/cassandra/db/ReadCommand.java Outdated
Comment thread src/java/org/apache/cassandra/cql3/Relation.java Outdated
Comment thread src/java/org/apache/cassandra/cql3/Relation.java Outdated
@k-rus k-rus self-requested a review May 14, 2025 18:15
Copy link
Copy Markdown
Member

@k-rus k-rus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to switch to review the tests first, so I have better understanding what the code is implementing. It's nice to see large amount of tests and that it's easy to follow them. I left some nit suggestions for improvements and more tests to include, like plans with index usage for range queries.
As it happened already, I might miss something or misunderstand.
As before it's only partial review. I reviewed 12 files from the 79.

Comment thread test/unit/org/apache/cassandra/index/sai/cql/PlanWithIndexHintsTest.java Outdated
Comment thread test/unit/org/apache/cassandra/index/sai/cql/PlanWithIndexHintsTest.java Outdated
Comment thread test/unit/org/apache/cassandra/index/sai/cql/PlanWithIndexHintsTest.java Outdated
Comment thread test/unit/org/apache/cassandra/index/sai/SAITester.java Outdated
Comment on lines 987 to 1012
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interestingly demonstrates tautology :) command is used to get the query plan and then command is given to the query plan's method to get searcher. It's definitely outside this PR scope, but I wonder if any improvements can be done there and, thus, follow up issue be created for this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is definitely room for improvement, but that's a matter for a separate ticket.

Copy link
Copy Markdown
Member

@k-rus k-rus May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be just duplicate of excluding idx1 tests. If needed, it can be put into a loop to remove the duplicate code.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included in the above changes

Comment thread test/unit/org/apache/cassandra/index/sai/cql/PlanWithIndexHintsTest.java Outdated
Comment thread test/unit/org/apache/cassandra/index/sai/cql/PlanWithIndexHintsTest.java Outdated
Comment thread test/unit/org/apache/cassandra/index/CustomIndexTest.java Outdated
Comment thread test/unit/org/apache/cassandra/index/CustomIndexTest.java Outdated
@k-rus k-rus self-requested a review May 16, 2025 10:10
Copy link
Copy Markdown
Member

@k-rus k-rus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This time I only reviewed IndexHintsTest.java. It's great that it's easy to follow the tests and undertand the results (I deliberately ignore comments explaining the tests so I am not influenced by comments and don't miss the actual test behaviors). However, there are many redundant tests with symmetric duplicates of queries or different values. I understand that it doesn't hurt to have more tests, but it does hurt to review them and then maintain them. It's good to remove semantically duplicated tests.

On another side tests of range queries are missing for SAI. To my understanding SAI supports range queries. The tests on range queries don't need to be exhaustive.
I bring the range query specifically, since I don't understand why queries with included indexes but restriction not on them are failing instead of using the included indexes on the entire range and filtering afterwards (see my comments).

Does it make sense to have few tests for ANN and BM25 query with hints?

What are the plans with CNDB's PR and will there be integration tests on hints? Can you go through the checklist as CNDB's PR is one of the item on the list?

Comment thread test/unit/org/apache/cassandra/db/filter/IndexHintsTest.java Outdated
Comment thread test/unit/org/apache/cassandra/db/filter/IndexHintsTest.java Outdated
Comment thread test/unit/org/apache/cassandra/db/filter/IndexHintsTest.java Outdated
Comment thread test/unit/org/apache/cassandra/db/filter/IndexHintsTest.java Outdated
Comment thread test/unit/org/apache/cassandra/db/filter/IndexHintsTest.java Outdated
Comment thread test/unit/org/apache/cassandra/db/filter/IndexHintsTest.java Outdated
Comment thread test/unit/org/apache/cassandra/db/filter/IndexHintsTest.java Outdated
Comment thread test/unit/org/apache/cassandra/db/filter/IndexHintsTest.java Outdated
Comment thread test/unit/org/apache/cassandra/db/filter/IndexHintsTest.java Outdated
Comment thread test/unit/org/apache/cassandra/db/filter/IndexHintsTest.java Outdated
@k-rus k-rus self-requested a review May 19, 2025 09:27
Copy link
Copy Markdown
Member

@k-rus k-rus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another portion of my review, mainly nits and minor suggestions. I reviewed 30 files out of 79. As usual I might not understood everything correctly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it's better to use different name from the above and not start with test. E.g., assertToCQLString.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duplicate of the same function in SinglePartitionReadCommandCQLTest. Can you share them instead of duplicating the code?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the same function, they parse a different type of command (parseReadCommand vs. parseReadCommandGroup). I can add a superclass with a utility method and an abstract method to be implemented by each subclass if we need to get rid of the similar 3-line methods.

Comment thread src/java/org/apache/cassandra/index/sai/plan/TopKProcessor.java Outdated
Comment thread src/java/org/apache/cassandra/index/sai/plan/Plan.java Outdated
Comment thread src/java/org/apache/cassandra/index/sai/analyzer/AnalyzerEqOperatorSupport.java Outdated
Comment on lines 774 to 776
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While looking to this code, which calculates intersection of selected and included indexes, I thought if it can be calculated once and then used in the expression loop. Calculating the intersection can be done on the initial call to checkHintsDoesntExceedIntersectionClauseLimit or during planning.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

Comment thread src/java/org/apache/cassandra/index/SecondaryIndexManager.java Outdated
Comment thread src/java/org/apache/cassandra/index/SecondaryIndexManager.java Outdated
@k-rus k-rus self-requested a review May 19, 2025 19:08
@adelapena adelapena force-pushed the CNDB-13129-main branch 2 times, most recently from 37eea4d to ebe2092 Compare May 20, 2025 23:49
Copy link
Copy Markdown
Member

@k-rus k-rus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submit a couple of current comments as I was focused on incident and hadn't continued with review.

Comment thread src/java/org/apache/cassandra/index/IndexRegistry.java Outdated
Comment thread src/java/org/apache/cassandra/index/IndexRegistry.java Outdated
Comment thread test/unit/org/apache/cassandra/db/filter/IndexHintsTest.java Outdated
@k-rus k-rus self-requested a review July 7, 2025 14:32
Copy link
Copy Markdown

@ekaterinadimitrova2 ekaterinadimitrova2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't finished the review, just posting a few small comments.
I had some incident and I wanted just to push what I had before I leave for the doctor

Comment thread test/unit/org/apache/cassandra/cql3/CQLTester.java Outdated
Comment thread test/unit/org/apache/cassandra/cql3/CQLTester.java Outdated
Comment thread test/unit/org/apache/cassandra/cql3/CQLTester.java Outdated
Comment thread test/unit/org/apache/cassandra/db/PartitionRangeReadCommandCQLTest.java Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name suggestion is not a bad idea for readability.

Comment thread src/java/org/apache/cassandra/index/SecondaryIndexManager.java Outdated
Comment thread src/java/org/apache/cassandra/index/SecondaryIndexManager.java Outdated
Comment thread src/java/org/apache/cassandra/index/SecondaryIndexManager.java Outdated
Comment thread src/java/org/apache/cassandra/index/SecondaryIndexManager.java Outdated
k-rus
k-rus previously requested changes Jul 8, 2025
Copy link
Copy Markdown
Member

@k-rus k-rus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that I found a bug in the logic of stripping subplans that optional subplans are not added to the list of stripped subplans. See my comment in Plan.java.

I submit the partial review due to the bug.
Other comments are nits to improve understandability of the code.

Comment thread src/java/org/apache/cassandra/db/filter/IndexHints.java Outdated
Comment thread src/java/org/apache/cassandra/db/filter/IndexHints.java Outdated
Comment thread src/java/org/apache/cassandra/index/Index.java Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow this sentence. Should it be:

Suggested change
* supplied expression. This forms part of the query validation done before a CQL select
* statement is executed.
* supplied expression. This forms a part of the query validation, which is done before executing a CQL select
* statement.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see now it's a copy-paste from the overload. Still it can be nice to improve readability.
Also to avoid repeating information it can be linked to the overload.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to rephrase the pre-existing sentence and link it from here? Or the other way around? Also, I don't see what's wrong with the original phrase.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it's good to be more specific instead of just otherwise. Am I correct that the index keeps the original column values?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. What would you write instead? A negation of the first part of the phrase? The index can either index the value as is, which is the normal behaviour, or transform it before indexing if it's an analyzed index.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I am newbie I don't full understand what is behind transforms, thus it is helpful to be explicit what otherwise means. I guess it will be something like

Suggested change
* @return {@code true} if this index transforms the indexed column values, {@code false} otherwise
* @return {@code true} if this index transforms the indexed column values, {@code false} if it keeps the original value

At the same time I understand that my formulation can confuse that something can fall in between.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't mistake priority is noun and not a verb, thus use prioritize.

Comment thread src/java/org/apache/cassandra/index/sai/StorageAttachedIndex.java Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document the method?
Also I prefer scanIndexes as scannedIndex confuses me with indexes, which were scanned.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is not used anymore; I'm removing it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right: adding the keys iteration again to the optional subplans. Shouldn't be?:

Suggested change
optionalSubplans.add(keysIteration);
newSubplans.add(keysIteration);

Can you add a test to catch this bug?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed a bug! I have fixed it and modified PlanWithIndexHintsTest#testIntersectionClauseLimit to exercise this part of the code.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be more clear:

Suggested change
List<Plan.KeysIteration> optionalSubplans = new ArrayList<>(clauseLimit);
List<Plan.KeysIteration> optionalNewSubplans = new ArrayList<>(clauseLimit);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

@k-rus k-rus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chunk of nits completes full iteration of my review.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to my IDE the import is not used.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not used

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment thread src/java/org/apache/cassandra/index/sai/plan/QueryController.java Outdated
Copy link
Copy Markdown
Member

@k-rus k-rus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit about new formulation

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* - AMBIGUOUS if an index that is not excluded by the hints supports EQ and a different one, also not exluded by
* - AMBIGUOUS if an index that is not excluded by the hints supports EQ and a different one, also not excluded by

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of the indexes is included by the hints

sounds that if both indexes are included then it's not ambiguous. If I don't mistake, only one should be hinted.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to if one of the indexes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public IndexHints parseIndexHints(TableMetadata table, IndexRegistry indexRegistry) throws RequestValidationException
/**
* Parse the Index Hints. This does not validate the hints' values or whether peers will be able to process them.
*
* @return the excluded and/or included indexes as part of these index hints, or {@link IndexHints#NONE} if no hints are present
* @throws InvalidRequestException if the index hints are invalid
*/
public IndexHints parseIndexHints(TableMetadata table, IndexRegistry indexRegistry) throws RequestValidationException

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding it with some changes.

Comment thread src/java/org/apache/cassandra/cql3/statements/SelectStatement.java Outdated
Comment thread src/java/org/apache/cassandra/db/view/View.java Outdated
Comment thread src/java/org/apache/cassandra/index/SecondaryIndexManager.java Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am missing something... this one was not looking for the best selectivity index, but that is what the new method that replaces it in IndexHints does now, too?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old getBestIndexFor assumed that there couldn't be multiple indexes for the same column and operator, besides analyzed and not-analyzed. However, we can have different index implementations for the same column. Like SAI and legacy, for example. So we need to choose between them too, using the selectivity stuff, same as in getBestIndexQueryPlanFor.

As we have discussed before, the duplication between getBestIndexQueryPlanFor and getBestIndexFor is a ticking bomb that we should address sometime as part of the CQL refactor. I think it all stems from needing access to the indexes before building the RowFilter (getBestIndexFor), and needing the RowFilter to build the Index.QueryPlan so we can use getBestIndexQueryPlanFor. This big refactor is something that we should definitely do, but separately.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh so that was a bug with two weeks duration in this method. Got it. Thanks :-)
I see you also added testing about that in IndexHintsTest. Great!

Copy link
Copy Markdown

@pkolaczk pkolaczk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked reviewing this PR very much, despite it being very big. Thank you!
No rereview needed.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion:
I know this pattern was already here, and you just followed it, but the semantics of this and similar methods are a bit surprising. I mean, it will always do what it is asked to, even when we ask it to do a thing it signals as an error. An attempt to overwrite will overwrite and throw.
As a user of this method I'd expect:

  1. If we don't allow overwriting, it should just throw the error on duplicate entry but not modify the map.
  2. If we allow overwriting, we should not throw.

I guess the original intent was 1. Hence, maybe we should change it to putIfAbsent (here and in other similar methods)?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I've changed it to putIfAbsent. I think in practice it doesn't matter that much what happens to the underlying map because the entire query will fail, but I agree that the method is confusing and even potentially dangerous if things change in the future.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just being paranoid here: Let's make sure n doesn't overflow the short.

Copy link
Copy Markdown
Author

@adelapena adelapena Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added checks in the IndexHints CQL entry point, so we cannot get here with that many hints, and also an assertion here, just in case.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opinion (not actionable now): Honestly I don't like this behavior at all - I can imagine changing the meaning of a query based on existence / non-existence of an index can lead to some nasty surprises. For now index hints provide some workaround, but I hope we eventually change this so that if you want to use the analyzed index, you'd have to either use a special operator (from which it would be clear which index serves it) or include the index name explicitly.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean changing to—if there is ALLOW FILTERING—it means ALLOW FILTERING always, and to use the index—we remove it or anything else?
I think this behavior is like that since ALLOW FILTERING exists (even if I do not like it :( ).
What we will do at some point, when it becomes more of a priority, is to ensure ALLOW FILTERING still works during index build, because that one is quite annoying. On the other hand, from user's perspective, they do not want to have to rewrite queries when they add an index.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this behaviour is not desirable at all. In the long term, we should get rid of indexes that change the behaviour of standard operators and instead use dedicated operators.

Copy link
Copy Markdown

@pkolaczk pkolaczk Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ekaterinadimitrova2

Do you mean changing to—if there is ALLOW FILTERING—it means ALLOW FILTERING always, and to use the index—we remove it or anything else?

No, I didn't mean that.

I mean that IMHO the result should be a function of only the query string and the data in the database, but not of the set of indexes. If I add an index, my queries should still return the same results.
If using the index changes the result of a query, because the index is analyzed ins some special way, then I'd expect this is somehow reflected in the query text, so that the meaning of the query is unambiguous.

This is typically solved by dedicated operators. We already do it with ANN / BM25.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I thought you talked about the ALLOW FILTERING ambiguity. Yeah, I agree about that.

@k-rus k-rus dismissed their stale review July 12, 2025 07:56

The bug was acknowledged and fixed. Thus dismissing the request for changes

Copy link
Copy Markdown

@ekaterinadimitrova2 ekaterinadimitrova2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with the prod code review, looks super neat. About the optimizer - I don't know the code well, but it seems ok to me and I am sure @pkolaczk would have noticed any regressions.

I will be done with the tests review tomorrow. (WIP) It shouldn't take me too long.

Comment thread src/java/org/apache/cassandra/index/sai/StorageAttachedIndex.java Outdated
Comment on lines 178 to 181
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug fix?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't understand, what do you mean?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we were doing that only for IndexScan node, now we have also ScoredIndexScan if I did not misread something.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I forgot that this one was already ported from this PR to CNDB-13925, so we are modifying it rather than adding it from scratch. Yes, it was missing to include the ORDER BY indexes, it only included the filtering ones (ScoredIndexScan is not an instance of IndexScan).

Comment thread doc/cql3/CQL.textile Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: <index-names> ?

Copy link
Copy Markdown

@ekaterinadimitrova2 ekaterinadimitrova2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to finish only looking into PlanWithIndexHintsTest tomorrow morning. So far so good!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assertNonIncludableIndexesError(query + "WITH included_indexes={legacy,literal}"); // literal is selected over idx1
assertNonIncludableIndexesError(query + "WITH included_indexes={legacy,literal}"); // literal is selected over legacy

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assertNonIncludableIndexesError(query + "WITH included_indexes={legacy,literal,analyzed}"); // analyzed is not applicable to =, literal is selected over idx1
assertNonIncludableIndexesError(query + "WITH included_indexes={legacy,literal,analyzed}"); // analyzed is not applicable to =, literal is selected over legacy

Comment thread test/unit/org/apache/cassandra/index/sai/cql/PlanWithIndexHintsTest.java Outdated
Copy link
Copy Markdown

@ekaterinadimitrova2 ekaterinadimitrova2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to finish only looking into PlanWithIndexHintsTest tomorrow morning. So far so good! Some tiny comments and questions

Comment thread test/unit/org/apache/cassandra/index/sai/cql/PlanWithIndexHintsTest.java Outdated
Comment thread test/unit/org/apache/cassandra/index/sai/cql/PlanWithIndexHintsTest.java Outdated
Comment on lines 669 to 671
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before

Comment thread test/unit/org/apache/cassandra/index/sai/cql/PlanWithIndexHintsTest.java Outdated
Comment on lines 715 to 717
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Comment thread test/unit/org/apache/cassandra/index/sai/cql/PlanWithIndexHintsTest.java Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkolaczk was suggesting we should allow exceeding the intersection clause if indexes are explicitly included. What happened to that idea?
#1670 (review)
I guess this would be for advanced users.
Maybe in the future we can add it behind a guardrail?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I forgot about that discussion. It probably makes sense to exclude hinted indexes from that limit. I have just changed the code accordingly.

Copy link
Copy Markdown

@ekaterinadimitrova2 ekaterinadimitrova2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is solid work! And I second Piotr that it was a pleasure to review. Apologize for the delayed review, I had to de-prioritize it too many times.
Left just one last question - https://github.com/datastax/cassandra/pull/1670/files#r2208045014

Moving to review the other PR, hold on committing this patch until that one is also approved. Thanks!

Copy link
Copy Markdown

@ekaterinadimitrova2 ekaterinadimitrova2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last change seem fine (CI is still running though). Only one question whether we want to emit a warning to users and whether using more hints than the intersection limit may be considered for more advanced users maybe

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if it will be overengineering to have a warning for users in case the query is using more indexes than the intersection allows in general because of the hints?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intersection limit is just a performance thing to avoid using more indexes than what is practical. Hitting that limit simply changes the way the query is planned. But a user providing hints to specify what indexes to use is already renouncing to query planning, so I think it's fine not to warn about the optimization in planning there have already opted out from by using hints.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I was thinking that an unexperienced user may add the hints and the intersection limit is there for a reason :D

@ekaterinadimitrova2
Copy link
Copy Markdown

+1 on not enabling the new messaging version for now. CI also seems good.

@adelapena adelapena force-pushed the CNDB-13129-main branch 2 times, most recently from 3ad5562 to 2cf6ba1 Compare July 23, 2025 18:40
Piotr Kołaczkowski and others added 2 commits July 25, 2025 00:12
Adds ability to provide directives into SELECT queries about what indexes should be used.
They consist of a set of indexes that should be used (included) and a set of indexes that
should not be used (excluded). The CQL syntax is:

SELECT ... FROM ...
  WHERE ... WITH included_indexes = { ... }
    AND excluded_indexes = { ... };

It’s guaranteed that the queries will utilize all the included indexes, or fail if it’s not
possible to do so. It will never happen that a query succeeds without using all the included
indexes. Queries might fail because the query doesn't have a restriction for those indexes,
because there is a restriction that could use the index but is not compatible with other
restrictions, or because the underlying index implementation isn't able to use the index
for some reason.

Excluded indexes will never make the query fail, unless they reference a non-existent index.
That's because it’s always possible to exclude an index regardless of the query expressions
and index implementation capabilities. However, excluding indexes might make it necessary
to add `ALLOW FILTERING` to the query.

Indexes that are applicable to the query and that are not mentioned in these two sets of
included and excluded indexes might or might not be used, depending on the index query
planner.
@sonarqubecloud
Copy link
Copy Markdown

@cassci-bot
Copy link
Copy Markdown

✔️ Build ds-cassandra-pr-gate/PR-1670 approved by Butler


Approved by Butler
See build details here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants