update retrieval quality article#1241
Conversation
✅ Deploy Preview for condescending-goldwasser-91acf0 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
| [Loading a dataset from Hugging Face hub](/documentation/tutorials/huggingface-datasets/) tutorial, `Qdrant/arxiv-titles-instructorxl-embeddings` | ||
| from the [Hugging Face hub](https://huggingface.co/datasets/Qdrant/arxiv-titles-instructorxl-embeddings). Let's download it in a streaming | ||
| mode, as we are only going to use part of it. | ||
| We’ll use a pre-embedded dataset from Hugging Face to train and test Qdrant’s search capabilities. First, load and split the dataset for training (1,000 items) and testing (100 items). |
|
@thierrypdamiba @davidmyriel I actually liked the fact that in the previous version we said that embeddings quality is crucial (maybe we paid it a bit more attention than required) and we explained why we're comparing exact search to ann, now the tutorial has become a bit faceless |
|
@joein @davidmyriel I added information about the quality and ann vs exact search. Also updated the numbers on the dataset to reflect the code. |
| "clipboard": "^2.0.11", | ||
| "qdrant-page-search": "^1.0.8" | ||
| "qdrant-page-search": "^1.0.8", | ||
| "react-router-dom": "^6.27.0" |
There was a problem hiding this comment.
We don't need it. Removing now.
Update text and format to better reflect the benefit of ANN vs KNN/exact search and why a user would want to measure retrieval quality" TODO: Add screenshots of how you can do this in the webui
| - **m**: This parameter determines the maximum number of connections per node in the HNSW graph. A higher value for `m` increases the connectivity of the graph, potentially improving search accuracy at the cost of increased memory usage and indexing time. The default value for `m` is 16. | ||
| - **ef_construct**: This parameter controls the size of the dynamic candidate list during index construction. A higher value of `ef_construct` leads to a more exhaustive search during the indexing phase, resulting in a higher quality graph and improved search accuracy. However, this comes at the cost of longer indexing times. The default value for `ef_construct` is 100. | ||
|
|
||
| We will use the untuned HNSW as the baseline to compare how changes affect the precision of the search. Initially, we will use the default values of `m` (16) and `ef_construct` (100) for the HNSW algorithm. Later, we will double these values to observe their impact on retrieval quality. |
There was a problem hiding this comment.
We have already written what the default values are, so we can shorten this sentence, like
"We'll use the default m and ef as a baseline and then tweak the params to see how it affects the precision of the search."
| - If you require higher precision, increase `m` and `ef_construct` while considering the increased memory usage and indexing time. | ||
| - If memory and indexing time are critical constraints, tune the parameters incrementally to find the right balance. |
There was a problem hiding this comment.
By the way, these is also a third parameter : ef (also known as efSearch, it controls the number of neighbors evaluated during the search, a higher value may increase precision, however, it also increases latency
| ``` | ||
|
|
||
| Response: | ||
| This step measures the initial retrieval quality before any tuning of the HNSW parameters. The HNSW (Hierarchical Navigable Small World) algorithm has two key parameters that influence search performance and quality: |
There was a problem hiding this comment.
We could provide a bit more details here:
There are 2 types of parameters which users can tune, index time parameters and search time parameters
index time: m and ef_construct, search time - ef
I think that we might want to mention it here, rather than just add a brief sentence at the end of the article
However, I don't find the code adjustments to be a necessity
Make changes to the retrieval quality article