-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding multiple languages to PgvectorDocumentstore for multilangual keyword search #924
Comments
@CharlesCousyn I see this feature as a valuable addition and would like to ask a few clarifying questions: Are we expecting users to know the language of the text at the time it is inserted into the document store? If so, we could set up an index where the language configuration is specified by another column (e.g., language), as demonstrated in the documentation: CREATE INDEX keyword_index ON documents USING GIN (to_tsvector(language, content)); However, I assume this might not be the desired outcome, as it can be difficult or even impossible to determine the language of the text before insertion. In this case, we could create an index for each desired language, allowing the user to specify the language at query time. This could result in a query like: WHERE to_tsvector('english', content) @@ plainto_tsquery('english', 'the search query string'); This approach works but poses the risk of false matches since there's no guarantee that the language of the content matches the language specified by the user. All text in the content field will be processed as if it's in the user-specified language (e.g., English in the example above), potentially leading to false positive matches. If needed, users could mitigate this risk by using the DocumentLanguageClassifier to store the language of the content, and then applying a metadata filter in the query pipeline to ensure alignment between the text's actual language and the query language. I'm happy to open a PR if you agree. |
I think it should be expected that the users know the language of the text at the time it is inserted into the document store! |
Thanks for the follow-up! @CharlesCousyn I had to put a little bit of thought into this. After further consideration, I'm leaning towards allowing users to create multiple document stores based on each language. This approach offers flexibility for various scenarios:
If we were to use a single document store, the solutions that come to mind are the ones I mentioned earlier. Particularly the first would make the most sense if the language is known ahead of insertion. CREATE INDEX keyword_index ON documents USING GIN (to_tsvector(language, content)); Where the field The main drawback of using multiple document stores would be the need for separate keyword extractors for each language. However, this seems like a reasonable trade-off for the flexibility and accuracy it provides. |
👋 In general, I tend to agree with the solution proposed by @kanenorman, which is simple and do requires no changes (I think). |
Is your feature request related to a problem? Please describe.
In our application, we use
PgvectorDocumentStore
to store a lot of documents. Our documents can be in english, french, etc.The problem is that we recently use the
PgvectorKeywordRetriever
to do some hybrid search and we just see that the document store can only have one language (by default english).Describe the solution you'd like
I suggest that the parameter
language: str
becomelanguages: List[str]
, which would define all wanted languages to use.It would necessitate to also modify the
PgvectorKeywordRetriever.run
to allow a new parameterlanguage
. This allows to use a specific keyword_index per query.Describe alternatives you've considered
An alternative solution would be to have multiple document store, one for each supported language. It could work but create multiple connections at the same time, I think.
The text was updated successfully, but these errors were encountered: