Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding multiple languages to PgvectorDocumentstore for multilangual keyword search #924

Open
CharlesCousyn opened this issue Jul 24, 2024 · 4 comments

Comments

@CharlesCousyn
Copy link

CharlesCousyn commented Jul 24, 2024

Is your feature request related to a problem? Please describe.
In our application, we use PgvectorDocumentStore to store a lot of documents. Our documents can be in english, french, etc.
The problem is that we recently use the PgvectorKeywordRetriever to do some hybrid search and we just see that the document store can only have one language (by default english).

Describe the solution you'd like
I suggest that the parameter language: str become languages: List[str], which would define all wanted languages to use.
It would necessitate to also modify the PgvectorKeywordRetriever.run to allow a new parameter language. This allows to use a specific keyword_index per query.

Describe alternatives you've considered
An alternative solution would be to have multiple document store, one for each supported language. It could work but create multiple connections at the same time, I think.

@anakin87 anakin87 transferred this issue from deepset-ai/haystack Jul 25, 2024
@anakin87 anakin87 added feature request Ideas to improve an integration integration:pgvector labels Jul 25, 2024
@kanenorman
Copy link
Contributor

kanenorman commented Sep 25, 2024

@CharlesCousyn I see this feature as a valuable addition and would like to ask a few clarifying questions:

Are we expecting users to know the language of the text at the time it is inserted into the document store? If so, we could set up an index where the language configuration is specified by another column (e.g., language), as demonstrated in the documentation:

CREATE INDEX keyword_index ON documents USING GIN (to_tsvector(language, content));

However, I assume this might not be the desired outcome, as it can be difficult or even impossible to determine the language of the text before insertion.

In this case, we could create an index for each desired language, allowing the user to specify the language at query time. This could result in a query like:

WHERE to_tsvector('english', content) @@ plainto_tsquery('english', 'the search query string');

This approach works but poses the risk of false matches since there's no guarantee that the language of the content matches the language specified by the user. All text in the content field will be processed as if it's in the user-specified language (e.g., English in the example above), potentially leading to false positive matches.

If needed, users could mitigate this risk by using the DocumentLanguageClassifier to store the language of the content, and then applying a metadata filter in the query pipeline to ensure alignment between the text's actual language and the query language.

I'm happy to open a PR if you agree.

@CharlesCousyn
Copy link
Author

I think it should be expected that the users know the language of the text at the time it is inserted into the document store!
The decision of what language use should be in the hand of the ones who know the text the best, i think.
A PR is a good idea!

@kanenorman
Copy link
Contributor

Thanks for the follow-up! @CharlesCousyn

I had to put a little bit of thought into this. After further consideration, I'm leaning towards allowing users to create multiple document stores based on each language. This approach offers flexibility for various scenarios:

  1. When the language is known at insertion time:
    Users can simply insert the document into the correct language-specific store.

  2. When the language is unknown at insertion time:
    Users can employ the DocumentLanguageClassifier as part of their pipeline, as demonstrated in this example: DocumentLanguageClassifier in a Pipeline

If we were to use a single document store, the solutions that come to mind are the ones I mentioned earlier. Particularly the first would make the most sense if the language is known ahead of insertion.

CREATE INDEX keyword_index ON documents USING GIN (to_tsvector(language, content));

Where the field language contains valid configurations. I'm not particularly fond of this approach as it would require a schema change and be more restrictive when the document's language is unknown at insertion time.

The main drawback of using multiple document stores would be the need for separate keyword extractors for each language. However, this seems like a reasonable trade-off for the flexibility and accuracy it provides.
Do you agree with this direction? I'm open to further discussion or clarification if needed. Perhaps it would be good to get feedback from the core-team before going further.

@anakin87
Copy link
Member

👋

In general, I tend to agree with the solution proposed by @kanenorman, which is simple and do requires no changes (I think).
It seems also consistent with the Haystack idea of Document Store, where each corresponds to a collection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants