Skip to content

Conversation

@leslieluyu
Copy link
Contributor

Description

For improve performance with large dataset, enable support HNSW in dataprep and retriever

  • In dataprep

    1. add parameter of vector_schema of from_texts_return_keys
    2. add VECTOR_SCHEMA environment variable to enable easily switch the algorithm
  • In retriver

    1. add ENABLE_SCHEMA in config.py
    2. add logic of using index_schema=INDEX_SCHEMA
    3. add redis_schema_hnsw.yml for enable HNSW

Issues

List the issue or RFC link this PR is working on. If there is no such link, please mark it as n/a.

Type of change

List the type of change like below. Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds new functionality)
  • Breaking change (fix or feature that would break existing design and interface)
  • Others (enhancement, documentation, validation, etc.)

Dependencies

List the newly introduced 3rd party dependency if exists.

Tests

Describe the tests that you ran to verify your changes.

see the performance comparison between this PR(HNSW) and v1.3 oob(FLAT) when ingested pubmed_100files.txt(≈3.7M chunks)
image
image

@joshuayao joshuayao added this to OPEA Jun 10, 2025
@joshuayao joshuayao added this to the v1.4 milestone Jun 10, 2025
@joshuayao joshuayao added the feature New feature or request label Jun 10, 2025
@leslieluyu leslieluyu closed this Jun 12, 2025
@github-project-automation github-project-automation bot moved this to Done in OPEA Jun 12, 2025
Signed-off-by: leslieluyu <[email protected]>
@leslieluyu leslieluyu reopened this Jun 12, 2025
@xiguiw xiguiw merged commit 1866ad7 into opea-project:main Jun 13, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or request

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants