Skip to content

[BUG] Improvements in HybridSearchConfig — Modifications in Score Fusion Logic and Configuration #234

@alberto-agudo

Description

@alberto-agudo

Hello @vishwarajanand. mentioning you here as you've been the contributor of the Hybrid Search solution. First, thank you for working on it.

I have recently refactored my code to include the HybridSearchConfig, and in doing so, I think I've encountered several critical issues that need addressing to ensure correctness in hybrid search behavior.


1. Major Issue: Incorrect Score Fusion Logic

The current default implementation linearly combines the semantic search distance (dense vector):

where_filters = f"WHERE {safe_filter}" if safe_filter else ""
dense_query_stmt = f"""SELECT {column_names}, {search_function}("{self.embedding_column}", {embedding_data_string}) as distance
FROM "{self.schema_name}"."{self.table_name}" {where_filters} ORDER BY "{self.embedding_column}" {operator} {embedding_data_string} LIMIT :k;
"""
param_dict = {"query_embedding": query_embedding, "k": k}

with the full-text search score from ts_rank_cd (sparse):

sparse_query_stmt = f'SELECT {column_names}, ts_rank_cd({content_tsv}, {query_tsv}) as distance FROM "{self.schema_name}"."{self.table_name}" WHERE {content_tsv} @@ {query_tsv} {and_filters} ORDER BY distance desc LIMIT {hybrid_search_config.secondary_top_k};'

This is fundamentally flawed because:

  • Dense distance: Lower is better (e.g., 0.1 is a strong match).
  • Sparse score: Higher is better (e.g., 0.9 is a strong match).

Current logic (weighted_sum_ranking):

weighted_scores: dict[str, dict[str, Any]] = {}
# Process results from primary source
for row in primary_search_results:
values = list(row.values())
doc_id = str(values[0]) # first value is doc_id
distance = float(values[-1]) # type: ignore # last value is distance
row_values = dict(row)
row_values["distance"] = primary_results_weight * distance
weighted_scores[doc_id] = row_values
# Process results from secondary source,
# adding to existing scores or creating new ones
for row in secondary_search_results:
values = list(row.values())
doc_id = str(values[0]) # first value is doc_id
distance = float(values[-1]) # type: ignore # last value is distance
primary_score = (
weighted_scores[doc_id]["distance"] if doc_id in weighted_scores else 0.0
)
row_values = dict(row)
row_values["distance"] = distance * secondary_results_weight + primary_score
weighted_scores[doc_id] = row_values
# Sort the results by weighted score in descending order
ranked_results = sorted(
weighted_scores.values(), key=lambda item: item["distance"], reverse=True
)
return ranked_results[:fetch_top_k]

# Example of the logic being used
final_score =  dense_distance * dense_weight + sparse_score * sparse_weight

This leads to an inconsistent score, where worse semantic matches can be ranked higher than better ones.

Note that reciprocal rank fusion overcomes these problems, but relies on each of the retrieved documents to be ordered in a ranking where the first element is the highest ranked and the last element is the lowest. This is the case for the inputs given to the reciprocal rank fusion function. However, both are sorted in descending order, which invalidates the assumption that lower distances are better (since the first element in the list will be the one with the highest distance).

for rank, row in enumerate(
sorted(primary_search_results, key=lambda item: item["distance"], reverse=True)
):

Hence, both alternatives for retrieving the top k documents from hybrid search are flawed.

Recommendations:

  1. Invert one of the metrics so that they follow the same shape (higher=better / lower=better).
  2. Normalize the metrics so that both scores fall under the same range.

You could also use a modified reciprocal rank fusion function that doesn't sort the inputs again.


2. Configuration and Querying Issues

  1. Inconsistent use of secondary_search_top_k

  2. HybridSearchConfig placement in the __query_collection method


Best,
Alberto.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions