[BUG] Improvements in `HybridSearchConfig` — Modifications in Score Fusion Logic and Configuration

Hello @vishwarajanand. mentioning you here as you've been the contributor of the Hybrid Search solution. First, thank you for working on it. 

I have recently refactored my code to include the `HybridSearchConfig`, and in doing so, I think I've encountered several critical issues that need addressing to ensure correctness in hybrid search behavior.

---

### 1. Major Issue: Incorrect Score Fusion Logic

The current default implementation **linearly combines** the semantic search **distance** (dense vector):
https://github.com/langchain-ai/langchain-postgres/blob/18b1bcdb75ed152da717e3d624e1ed822d17d60f/langchain_postgres/v2/async_vectorstore.py#L618-L622

with the full-text search [**score**](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING) from `ts_rank_cd` (sparse):
https://github.com/langchain-ai/langchain-postgres/blob/18b1bcdb75ed152da717e3d624e1ed822d17d60f/langchain_postgres/v2/async_vectorstore.py#L663


This is fundamentally flawed because:

- **Dense distance**: Lower is better (e.g., 0.1 is a strong match).
- **Sparse score**: Higher is better (e.g., 0.9 is a strong match).

**Current logic (`weighted_sum_ranking`):**
https://github.com/langchain-ai/langchain-postgres/blob/18b1bcdb75ed152da717e3d624e1ed822d17d60f/langchain_postgres/v2/hybrid_search_config.py#L36-L64
```python
# Example of the logic being used
final_score =  dense_distance * dense_weight + sparse_score * sparse_weight
```

This leads to an inconsistent score, where worse semantic matches can be ranked higher than better ones.

Note that reciprocal rank fusion overcomes these problems, but relies on each of the retrieved documents to be ordered in a ranking where the first element is the highest ranked and the last element is the lowest. This is the case for the inputs given to the reciprocal rank fusion function. However, both are sorted in descending order, which invalidates the assumption that lower distances are better (since the first element in the list will be the one with the highest distance).
https://github.com/langchain-ai/langchain-postgres/blob/18b1bcdb75ed152da717e3d624e1ed822d17d60f/langchain_postgres/v2/hybrid_search_config.py#L93-L95

Hence, both alternatives for retrieving the top k documents from hybrid search are flawed. 

#### Recommendations:
1. Invert one of the metrics so that they follow the same shape (higher=better / lower=better).
2. Normalize the metrics so that both scores fall under the same range. 

You could also use a modified reciprocal rank fusion function that doesn't sort the inputs again.

---

### 2. Configuration and Querying Issues

1. **Inconsistent use of `secondary_search_top_k`**  
   - For the dense search, k is used https://github.com/langchain-ai/langchain-postgres/blob/18b1bcdb75ed152da717e3d624e1ed822d17d60f/langchain_postgres/v2/async_vectorstore.py#L622 but later the SQL query for sparse search uses `hybrid_search_config.secondary_top_k` instead.
   - **Fix**: I would use `k` for both queries.

2. **HybridSearchConfig placement** in the `__query_collection` method  
   - Currently initialized too late in the call stack https://github.com/langchain-ai/langchain-postgres/blob/18b1bcdb75ed152da717e3d624e1ed822d17d60f/langchain_postgres/v2/async_vectorstore.py#L640-L642
   - **Fix**: Move initialization before `k` is defined https://github.com/langchain-ai/langchain-postgres/blob/18b1bcdb75ed152da717e3d624e1ed822d17d60f/langchain_postgres/v2/async_vectorstore.py#L583-L592 to ensure proper parameter propagation.

---


Best, 
Alberto.

	where_filters = f"WHERE {safe_filter}" if safe_filter else ""
	dense_query_stmt = f"""SELECT {column_names}, {search_function}("{self.embedding_column}", {embedding_data_string}) as distance
	FROM "{self.schema_name}"."{self.table_name}" {where_filters} ORDER BY "{self.embedding_column}" {operator} {embedding_data_string} LIMIT :k;
	"""
	param_dict = {"query_embedding": query_embedding, "k": k}

	weighted_scores: dict[str, dict[str, Any]] = {}

	# Process results from primary source
	for row in primary_search_results:
	values = list(row.values())
	doc_id = str(values[0]) # first value is doc_id
	distance = float(values[-1]) # type: ignore # last value is distance
	row_values = dict(row)
	row_values["distance"] = primary_results_weight * distance
	weighted_scores[doc_id] = row_values

	# Process results from secondary source,
	# adding to existing scores or creating new ones
	for row in secondary_search_results:
	values = list(row.values())
	doc_id = str(values[0]) # first value is doc_id
	distance = float(values[-1]) # type: ignore # last value is distance
	primary_score = (
	weighted_scores[doc_id]["distance"] if doc_id in weighted_scores else 0.0
	)
	row_values = dict(row)
	row_values["distance"] = distance * secondary_results_weight + primary_score
	weighted_scores[doc_id] = row_values

	# Sort the results by weighted score in descending order
	ranked_results = sorted(
	weighted_scores.values(), key=lambda item: item["distance"], reverse=True
	)
	return ranked_results[:fetch_top_k]

	if not k:
	k = (
	max(
	self.k,
	self.hybrid_search_config.primary_top_k,
	self.hybrid_search_config.secondary_top_k,
	)
	if self.hybrid_search_config
	else self.k
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Improvements in `HybridSearchConfig` — Modifications in Score Fusion Logic and Configuration #234

1. Major Issue: Incorrect Score Fusion Logic

Recommendations:

2. Configuration and Querying Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	for rank, row in enumerate(
	sorted(primary_search_results, key=lambda item: item["distance"], reverse=True)
	):

	hybrid_search_config = kwargs.get(
	"hybrid_search_config", self.hybrid_search_config
	)

[BUG] Improvements in HybridSearchConfig — Modifications in Score Fusion Logic and Configuration #234

Description

1. Major Issue: Incorrect Score Fusion Logic

Recommendations:

2. Configuration and Querying Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG] Improvements in `HybridSearchConfig` — Modifications in Score Fusion Logic and Configuration #234