Skip to content

update retrieval quality article#1241

Open
thierrypdamiba wants to merge 13 commits intomasterfrom
retreival-quality
Open

update retrieval quality article#1241
thierrypdamiba wants to merge 13 commits intomasterfrom
retreival-quality

Conversation

@thierrypdamiba
Copy link
Copy Markdown
Contributor

Make changes to the retrieval quality article

@netlify
Copy link
Copy Markdown

netlify Bot commented Oct 17, 2024

Deploy Preview for condescending-goldwasser-91acf0 ready!

Name Link
🔨 Latest commit 37e8f14
🔍 Latest deploy log https://app.netlify.com/sites/condescending-goldwasser-91acf0/deploys/671fc5c27a73130008d0e984
😎 Deploy Preview https://deploy-preview-1241--condescending-goldwasser-91acf0.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@davidmyriel davidmyriel requested a review from joein October 18, 2024 03:27
[Loading a dataset from Hugging Face hub](/documentation/tutorials/huggingface-datasets/) tutorial, `Qdrant/arxiv-titles-instructorxl-embeddings`
from the [Hugging Face hub](https://huggingface.co/datasets/Qdrant/arxiv-titles-instructorxl-embeddings). Let's download it in a streaming
mode, as we are only going to use part of it.
We’ll use a pre-embedded dataset from Hugging Face to train and test Qdrant’s search capabilities. First, load and split the dataset for training (1,000 items) and testing (100 items).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

differs from the code values

@joein
Copy link
Copy Markdown
Member

joein commented Oct 18, 2024

@thierrypdamiba @davidmyriel I actually liked the fact that in the previous version we said that embeddings quality is crucial (maybe we paid it a bit more attention than required) and we explained why we're comparing exact search to ann, now the tutorial has become a bit faceless

@thierrypdamiba
Copy link
Copy Markdown
Contributor Author

@joein @davidmyriel I added information about the quality and ann vs exact search. Also updated the numbers on the dataset to reflect the code.

Comment thread qdrant-landing/package.json Outdated
"clipboard": "^2.0.11",
"qdrant-page-search": "^1.0.8"
"qdrant-page-search": "^1.0.8",
"react-router-dom": "^6.27.0"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need it. Removing now.

thierrypdamiba and others added 7 commits October 22, 2024 10:40
Update text and format to better reflect the benefit of ANN vs KNN/exact search and why a user would want to measure retrieval quality"

TODO: Add screenshots of how you can do this in the webui
- **m**: This parameter determines the maximum number of connections per node in the HNSW graph. A higher value for `m` increases the connectivity of the graph, potentially improving search accuracy at the cost of increased memory usage and indexing time. The default value for `m` is 16.
- **ef_construct**: This parameter controls the size of the dynamic candidate list during index construction. A higher value of `ef_construct` leads to a more exhaustive search during the indexing phase, resulting in a higher quality graph and improved search accuracy. However, this comes at the cost of longer indexing times. The default value for `ef_construct` is 100.

We will use the untuned HNSW as the baseline to compare how changes affect the precision of the search. Initially, we will use the default values of `m` (16) and `ef_construct` (100) for the HNSW algorithm. Later, we will double these values to observe their impact on retrieval quality.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already written what the default values are, so we can shorten this sentence, like
"We'll use the default m and ef as a baseline and then tweak the params to see how it affects the precision of the search."

Comment on lines +182 to +183
- If you require higher precision, increase `m` and `ef_construct` while considering the increased memory usage and indexing time.
- If memory and indexing time are critical constraints, tune the parameters incrementally to find the right balance.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, these is also a third parameter : ef (also known as efSearch, it controls the number of neighbors evaluated during the search, a higher value may increase precision, however, it also increases latency

Comment thread qdrant-landing/package.json
```

Response:
This step measures the initial retrieval quality before any tuning of the HNSW parameters. The HNSW (Hierarchical Navigable Small World) algorithm has two key parameters that influence search performance and quality:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could provide a bit more details here:
There are 2 types of parameters which users can tune, index time parameters and search time parameters
index time: m and ef_construct, search time - ef

I think that we might want to mention it here, rather than just add a brief sentence at the end of the article
However, I don't find the code adjustments to be a necessity

@thierrypdamiba thierrypdamiba requested a review from joein October 28, 2024 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants