Adding multiple languages to PgvectorDocumentstore for multilangual keyword search #924

CharlesCousyn · 2024-07-24T21:21:57Z

Is your feature request related to a problem? Please describe.
In our application, we use PgvectorDocumentStore to store a lot of documents. Our documents can be in english, french, etc.
The problem is that we recently use the PgvectorKeywordRetriever to do some hybrid search and we just see that the document store can only have one language (by default english).

Describe the solution you'd like
I suggest that the parameter language: str become languages: List[str], which would define all wanted languages to use.
It would necessitate to also modify the PgvectorKeywordRetriever.run to allow a new parameter language. This allows to use a specific keyword_index per query.

Describe alternatives you've considered
An alternative solution would be to have multiple document store, one for each supported language. It could work but create multiple connections at the same time, I think.

The text was updated successfully, but these errors were encountered:

kanenorman · 2024-09-25T22:25:37Z

@CharlesCousyn I see this feature as a valuable addition and would like to ask a few clarifying questions:

Are we expecting users to know the language of the text at the time it is inserted into the document store? If so, we could set up an index where the language configuration is specified by another column (e.g., language), as demonstrated in the documentation:

CREATE INDEX keyword_index ON documents USING GIN (to_tsvector(language, content));

However, I assume this might not be the desired outcome, as it can be difficult or even impossible to determine the language of the text before insertion.

In this case, we could create an index for each desired language, allowing the user to specify the language at query time. This could result in a query like:

WHERE to_tsvector('english', content) @@ plainto_tsquery('english', 'the search query string');

This approach works but poses the risk of false matches since there's no guarantee that the language of the content matches the language specified by the user. All text in the content field will be processed as if it's in the user-specified language (e.g., English in the example above), potentially leading to false positive matches.

If needed, users could mitigate this risk by using the DocumentLanguageClassifier to store the language of the content, and then applying a metadata filter in the query pipeline to ensure alignment between the text's actual language and the query language.

I'm happy to open a PR if you agree.

CharlesCousyn · 2024-09-26T15:17:57Z

I think it should be expected that the users know the language of the text at the time it is inserted into the document store!
The decision of what language use should be in the hand of the ones who know the text the best, i think.
A PR is a good idea!

kanenorman · 2024-10-04T00:25:44Z

Thanks for the follow-up! @CharlesCousyn

I had to put a little bit of thought into this. After further consideration, I'm leaning towards allowing users to create multiple document stores based on each language. This approach offers flexibility for various scenarios:

When the language is known at insertion time:
Users can simply insert the document into the correct language-specific store.
When the language is unknown at insertion time:
Users can employ the DocumentLanguageClassifier as part of their pipeline, as demonstrated in this example: DocumentLanguageClassifier in a Pipeline

If we were to use a single document store, the solutions that come to mind are the ones I mentioned earlier. Particularly the first would make the most sense if the language is known ahead of insertion.

CREATE INDEX keyword_index ON documents USING GIN (to_tsvector(language, content));

Where the field language contains valid configurations. I'm not particularly fond of this approach as it would require a schema change and be more restrictive when the document's language is unknown at insertion time.

The main drawback of using multiple document stores would be the need for separate keyword extractors for each language. However, this seems like a reasonable trade-off for the flexibility and accuracy it provides.
Do you agree with this direction? I'm open to further discussion or clarification if needed. Perhaps it would be good to get feedback from the core-team before going further.

anakin87 · 2024-10-28T09:52:21Z

👋

In general, I tend to agree with the solution proposed by @kanenorman, which is simple and do requires no changes (I think).
It seems also consistent with the Haystack idea of Document Store, where each corresponds to a collection.

anakin87 transferred this issue from deepset-ai/haystack Jul 25, 2024

anakin87 added feature request Ideas to improve an integration integration:pgvector labels Jul 25, 2024

julian-risch added the P3 label Aug 26, 2024

anakin87 added the community-triage label Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding multiple languages to PgvectorDocumentstore for multilangual keyword search #924

Adding multiple languages to PgvectorDocumentstore for multilangual keyword search #924

CharlesCousyn commented Jul 24, 2024 •

edited

Loading

kanenorman commented Sep 25, 2024 •

edited

Loading

CharlesCousyn commented Sep 26, 2024

kanenorman commented Oct 4, 2024

anakin87 commented Oct 28, 2024

Adding multiple languages to PgvectorDocumentstore for multilangual keyword search #924

Adding multiple languages to PgvectorDocumentstore for multilangual keyword search #924

Comments

CharlesCousyn commented Jul 24, 2024 • edited Loading

kanenorman commented Sep 25, 2024 • edited Loading

CharlesCousyn commented Sep 26, 2024

kanenorman commented Oct 4, 2024

anakin87 commented Oct 28, 2024

CharlesCousyn commented Jul 24, 2024 •

edited

Loading

kanenorman commented Sep 25, 2024 •

edited

Loading