-
Notifications
You must be signed in to change notification settings - Fork 100
Description
Hello @vishwarajanand. mentioning you here as you've been the contributor of the Hybrid Search solution. First, thank you for working on it.
I have recently refactored my code to include the HybridSearchConfig
, and in doing so, I think I've encountered several critical issues that need addressing to ensure correctness in hybrid search behavior.
1. Major Issue: Incorrect Score Fusion Logic
The current default implementation linearly combines the semantic search distance (dense vector):
langchain-postgres/langchain_postgres/v2/async_vectorstore.py
Lines 618 to 622 in 18b1bcd
where_filters = f"WHERE {safe_filter}" if safe_filter else "" | |
dense_query_stmt = f"""SELECT {column_names}, {search_function}("{self.embedding_column}", {embedding_data_string}) as distance | |
FROM "{self.schema_name}"."{self.table_name}" {where_filters} ORDER BY "{self.embedding_column}" {operator} {embedding_data_string} LIMIT :k; | |
""" | |
param_dict = {"query_embedding": query_embedding, "k": k} |
with the full-text search score from ts_rank_cd
(sparse):
sparse_query_stmt = f'SELECT {column_names}, ts_rank_cd({content_tsv}, {query_tsv}) as distance FROM "{self.schema_name}"."{self.table_name}" WHERE {content_tsv} @@ {query_tsv} {and_filters} ORDER BY distance desc LIMIT {hybrid_search_config.secondary_top_k};' |
This is fundamentally flawed because:
- Dense distance: Lower is better (e.g., 0.1 is a strong match).
- Sparse score: Higher is better (e.g., 0.9 is a strong match).
Current logic (weighted_sum_ranking
):
langchain-postgres/langchain_postgres/v2/hybrid_search_config.py
Lines 36 to 64 in 18b1bcd
weighted_scores: dict[str, dict[str, Any]] = {} | |
# Process results from primary source | |
for row in primary_search_results: | |
values = list(row.values()) | |
doc_id = str(values[0]) # first value is doc_id | |
distance = float(values[-1]) # type: ignore # last value is distance | |
row_values = dict(row) | |
row_values["distance"] = primary_results_weight * distance | |
weighted_scores[doc_id] = row_values | |
# Process results from secondary source, | |
# adding to existing scores or creating new ones | |
for row in secondary_search_results: | |
values = list(row.values()) | |
doc_id = str(values[0]) # first value is doc_id | |
distance = float(values[-1]) # type: ignore # last value is distance | |
primary_score = ( | |
weighted_scores[doc_id]["distance"] if doc_id in weighted_scores else 0.0 | |
) | |
row_values = dict(row) | |
row_values["distance"] = distance * secondary_results_weight + primary_score | |
weighted_scores[doc_id] = row_values | |
# Sort the results by weighted score in descending order | |
ranked_results = sorted( | |
weighted_scores.values(), key=lambda item: item["distance"], reverse=True | |
) | |
return ranked_results[:fetch_top_k] |
# Example of the logic being used
final_score = dense_distance * dense_weight + sparse_score * sparse_weight
This leads to an inconsistent score, where worse semantic matches can be ranked higher than better ones.
Note that reciprocal rank fusion overcomes these problems, but relies on each of the retrieved documents to be ordered in a ranking where the first element is the highest ranked and the last element is the lowest. This is the case for the inputs given to the reciprocal rank fusion function. However, both are sorted in descending order, which invalidates the assumption that lower distances are better (since the first element in the list will be the one with the highest distance).
langchain-postgres/langchain_postgres/v2/hybrid_search_config.py
Lines 93 to 95 in 18b1bcd
for rank, row in enumerate( | |
sorted(primary_search_results, key=lambda item: item["distance"], reverse=True) | |
): |
Hence, both alternatives for retrieving the top k documents from hybrid search are flawed.
Recommendations:
- Invert one of the metrics so that they follow the same shape (higher=better / lower=better).
- Normalize the metrics so that both scores fall under the same range.
You could also use a modified reciprocal rank fusion function that doesn't sort the inputs again.
2. Configuration and Querying Issues
-
Inconsistent use of
secondary_search_top_k
- For the dense search, k is used
param_dict = {"query_embedding": query_embedding, "k": k} hybrid_search_config.secondary_top_k
instead. - Fix: I would use
k
for both queries.
- For the dense search, k is used
-
HybridSearchConfig placement in the
__query_collection
method- Currently initialized too late in the call stack
langchain-postgres/langchain_postgres/v2/async_vectorstore.py
Lines 640 to 642 in 18b1bcd
hybrid_search_config = kwargs.get( "hybrid_search_config", self.hybrid_search_config ) - Fix: Move initialization before
k
is definedlangchain-postgres/langchain_postgres/v2/async_vectorstore.py
Lines 583 to 592 in 18b1bcd
if not k: k = ( max( self.k, self.hybrid_search_config.primary_top_k, self.hybrid_search_config.secondary_top_k, ) if self.hybrid_search_config else self.k )
- Currently initialized too late in the call stack
Best,
Alberto.