-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse KNNVectorFieldData for reduce disk usage #1571
base: main
Are you sure you want to change the base?
Conversation
@navneet1v Easy test:
|
My question was how we are ensuring that KNNSubphase is not running during the search and running only during the re-indexing. |
@navneet1v gotcha, I will do the continues tests like reindex and other scenarios. |
Also there is something called as _recovery_source which is added as a fallback to support the re-indexing. If you are testing locally I would recommend to remove these line of code To ensure that recovery source is never created. This recovery source gets deleted after some when if indexing is happening continuously, but I have never tested this to understand does this really happen or not. |
src/main/java/org/opensearch/knn/index/KNNVectorDVLeafFieldData.java
Outdated
Show resolved
Hide resolved
@navneet1v i tested search source and reindex scenarios with |
src/main/java/org/opensearch/knn/index/fetch/KNNFetchSubPhase.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/KNNVectorDVLeafFieldData.java
Outdated
Show resolved
Hide resolved
@jmazanec15 as I see, we say also, I see we are trying to reconstruct the vector format with but I am not sure Which what do you think which is better. |
@navneet1v I added IT tests for the search, and reindex scenarios, I think it works with |
src/main/java/org/opensearch/knn/index/KNNVectorDVLeafFieldData.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/fetch/KNNFetchSubPhase.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/KNNVectorDVLeafFieldData.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/KNNVectorDVLeafFieldData.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/fetch/KNNFetchSubPhase.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/fetch/KNNFetchSubPhase.java
Outdated
Show resolved
Hide resolved
can we test update use case too. Ref: #1694 |
At 469fbf3 i added nested field process logic. and added tests for update, search, reindex, nested scenarios.
@navneet1v also added update tests, it shows works only for update vector field, like following tests:
but when excluded from _source and do update operation, it goes to logic: it using lucene source to do update. as i know, in the original reference there is a warning that when exclude source, we can not use there is 2 scenarios:
|
The #1 scenario will work if you exclude the vector field from _source and then go ahead and update vector field only. I think this will even work if we don't have this capability, because we can gather other fields from this _source. The thing which I was really interested in testing was #2 and seems like it is not possible. |
@navneet1v i added tests at
2nd scenarios i think it is because that we synthetic at but Update Logic in InternalEngine#L1311-L1318 fetch |
Yes I thought so. |
src/main/java/org/opensearch/knn/index/fetch/KNNFetchSubPhase.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/fetch/KNNFetchSubPhase.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/fetch/KNNFetchSubPhase.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/fetch/KNNFetchSubPhase.java
Outdated
Show resolved
Hide resolved
0383372
to
72081bb
Compare
@navneet1v @jmazanec15 i resolved all review comments and all tests passed. |
AFAIK, the plugin's sub fetch phases will run after the OS core engine's sub fetch phases, which includes the In current OS implementation, sub fetch phase has no order concept like Please let me know if I have misunderstand it. |
@bugmakerrrrrr
|
The fetch subphases runs in a order ref. The order is first all the Fetch subphases defined in Opensearch core will run and then all the plugins phases are run. The only exception for this is the InnerHit Fetchphase. The inner hit fetch phase is run at the end. Ref: https://github.com/opensearch-project/OpenSearch/blob/52b27f47bca5b3ab52cab237542f32c307d203b4/server/src/main/java/org/opensearch/search/fetch/FetchPhase.java#L104-L107 The order of these phases cannot be changed. |
@navneet1v as i see the logic for
But |
@navneet1v Indeed, this is the key point that I want to emphasize, and it is precisely why I suggest that we consider incorporating the filter logic that you mentioned in your comment into the KNNFetchSubPhase. Otherwise, it will cause conflicts at the API level (I requested to exclude certain fields in the response, but they appeared in the response). Or if it is too complex to implement the filter logic, we can consider it as a limitation and clearly mark it in the document. |
LGTM, i like it. |
@luyuncheng and @bugmakerrrrrr agreed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@navneet1v @luyuncheng I've checked the filter logic in FetchSourcePhase, and I think that it's too complicated to implement in this subphase.
@luyuncheng can we fix up the comments and so that we can merge this change? |
Signed-off-by: luyuncheng <[email protected]>
@navneet1v FIXED at 2a61fcd |
Thanks @luyuncheng. I have been reviewing and I think overall it looks good. Im still not confident on the nested portion, particularly the Also, can we capture a list of known limitations in the issue? Somewhere we can refer to when developing the documentation what can and cannot be done with this feature. When testing, here is what I have:
Also, do we know if it works with partially constructed non-nested documents? Are there any other limitations for non-nested case? The functionality that will work:
|
Description
in some scenarios, we want to
reduce the disk usage
andio throughput
for the source field. so, we would excludes knn fields in mapping which do not store the source like( this would make knn field can not be retrieve and rebuild)so I propose to use doc_values field for the vector fields. like:
Proposal
KNNVectorDVLeafFieldData
get data from docvaluesi rewrite
KNNVectorDVLeafFieldData
and makeKNN80BinaryDocValues
can return the specific knndocvalue_fields
like: (vector_field1
is knn field type)optimize result:
1m SIFT dataset, 1 shard,
with source store: 1389MB
without source store: 1055MB(-24%)
for the continues dive in to
knndocvalues
fields, I think when use faiss engine, we can usereconstruct_n
interface to retrieve the specific doc values and save the disk usage forBinaryDocValuesFormat
. or like this issue comments for redesign aKnnVectorsFormat
I added
KNNFetchSubPhase
and add a processor likeFetchSourcePhase#FetchSubPhaseProcessor
to combine thedocvalue_fields
into_source
something likesynthetic
logicIssues Resolved
#1087
#1572
KNNVectorDVLeafFieldData
can return the vectorDocValue fields like script do._source
with 1st step docValues fields response. and this way something likesynthetic source
but need explicit add value from search body likedocvalue_fields
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.