docs: add hybrid search documentation (#41)

xxsc0529 · web-flow · commit f5db55027f62 · 2025-11-10T14:53:50.000+08:00
diff --git a/README.md b/README.md
@@ -33,10 +33,11 @@ For detailed release notes and changelog, see [RELEASE_NOTES.md](RELEASE_NOTES.m
 
 ## Usage
 
-`pyobvector` supports two modes:
+`pyobvector` supports three modes:
 
 - `Milvus compatible mode`: You can use the `MilvusLikeClient` class to use vector storage in a way similar to the Milvus API
 - `SQLAlchemy hybrid mode`: You can use the vector storage function provided by the `ObVecClient` class and execute the relational database statement with the SQLAlchemy library. In this mode, you can regard `pyobvector` as an extension of SQLAlchemy.
+- `Hybrid Search mode`: You can use the `HybridSearch` class to perform hybrid search that combines full-text search and vector similarity search, with Elasticsearch-compatible query syntax.
 
 ### Milvus compatible mode
 
@@ -210,22 +211,21 @@ res = self.client.ann_search(
 The `ann_search` method supports flexible output column selection through the `output_columns` parameter:
 
 - **`output_columns`** (recommended): Accepts SQLAlchemy Column objects, expressions, or a mix of both
+
   - Column objects: `table.c.id`, `table.c.name`
   - Expressions: `(table.c.age + 10).label('age_plus_10')`
   - JSON queries: `text("JSON_EXTRACT(meta, '$.key') as extracted_key")`
   - String functions: `func.concat(table.c.name, ' (', table.c.age, ')').label('name_age')`
-
 - **`output_column_names`** (legacy): Accepts list of column name strings
-  - Example: `['id', 'name', 'meta']`
 
+  - Example: `['id', 'name', 'meta']`
 - **Parameter Priority**: `output_columns` takes precedence over `output_column_names` when both are provided
-
 - **`distance_threshold`** (optional): Filter results by distance threshold
+
   - Type: `Optional[float]`
   - Only returns results where `distance <= threshold`
   - Example: `distance_threshold=0.5` returns only results with distance <= 0.5
   - Use case: Quality control for similarity search, only return highly similar results
-
 - If you want to use pure `SQLAlchemy` API with `OceanBase` dialect, you can just get an `SQLAlchemy.engine` via `client.engine`. The engine can also be created as following:
 
 ```python
@@ -263,3 +263,170 @@ engine = create_async_engine(connection_str)
 ```
 
 - For further usage in pure `SQLAlchemy` mode, please refer to [SQLAlchemy](https://www.sqlalchemy.org/)
+
+### Hybrid Search Mode
+
+`pyobvector` supports hybrid search that combines full-text search and vector similarity search, with query syntax compatible with Elasticsearch. This allows you to perform semantic search with both keyword matching and vector similarity in a single query.
+
+- setup a client:
+
+```python
+from pyobvector import *
+from pyobvector.client.hybrid_search import HybridSearch
+from sqlalchemy import Column, Integer, VARCHAR
+
+client = HybridSearch(uri="127.0.0.1:2881", user="test@test")
+```
+
+**Note**: Hybrid search requires OceanBase version >= 4.4.1.0, or SeekDB.
+
+- create a table with both vector index and full-text index:
+
+```python
+test_table_name = "hybrid_search_test"
+
+# create table with vector and text columns
+client.create_table(
+    table_name=test_table_name,
+    columns=[
+        Column("id", Integer, primary_key=True, autoincrement=False),
+        Column("source_id", VARCHAR(32)),
+        Column("enabled", Integer),
+        Column("vector", VECTOR(3)),  # vector column
+        Column("title", VARCHAR(255)),  # text column for full-text search
+        Column("content", VARCHAR(255)),  # text column for full-text search
+    ],
+    indexes=[
+        VectorIndex("vec_idx", "vector", params="distance=l2, type=hnsw, lib=vsag"),
+    ],
+    mysql_charset='utf8mb4',
+    mysql_collate='utf8mb4_unicode_ci',
+)
+
+# create full-text indexes for text columns
+from pyobvector import FtsIndexParam, FtsParser
+
+for col in ["title", "content"]:
+    client.create_fts_idx_with_fts_index_param(
+        table_name=test_table_name,
+        fts_idx_param=FtsIndexParam(
+            index_name=f"fts_idx_{col}",
+            field_names=[col],
+            parser_type=FtsParser.IK,  # or other parser types
+        ),
+    )
+```
+
+- insert data:
+
+```python
+client.insert(
+    table_name=test_table_name,
+    data=[
+        {
+            "id": 1,
+            "source_id": "3b767712b57211f09c170242ac130008",
+            "enabled": 1,
+            "vector": [1, 1, 1],
+            "title": "企业版和社区版的功能差异",
+            "content": "OceanBase 数据库提供企业版和社区版两种形态。",
+        },
+        {
+            "id": 2,
+            "vector": [1, 2, 3],
+            "enabled": 1,
+            "source_id": "3b791472b57211f09c170242ac130008",
+            "title": "快速体验 OceanBase 社区版",
+            "content": "本文根据使用场景详细介绍如何快速部署 OceanBase 数据库。",
+        },
+        # ... more data
+    ]
+)
+```
+
+- perform hybrid search with Elasticsearch-compatible query syntax:
+
+```python
+# build query body (compatible with Elasticsearch syntax)
+query = {
+    "bool": {
+        "must": [
+            {
+                "query_string": {
+                    "fields": ["title^10", "content"],  # field weights
+                    "type": "best_fields",
+                    "query": "oceanbase 数据 迁移",
+                    "minimum_should_match": "30%",
+                    "boost": 1
+                }
+            }
+        ],
+        "filter": [
+            {
+                "terms": {
+                    "source_id": [
+                        "3b791472b57211f09c170242ac130008",
+                        "3b7af31eb57211f09c170242ac130008"
+                    ]
+                }
+            },
+            {
+                "bool": {
+                    "must_not": [
+                        {
+                            "range": {
+                                "enabled": {"lt": 1}
+                            }
+                        }
+                    ]
+                }
+            }
+        ],
+        "boost": 0.7
+    }
+}
+
+body = {
+    "query": query,
+    "knn": {  # vector similarity search
+        "field": "vector",
+        "k": 1024,
+        "num_candidates": 1024,
+        "query_vector": [1, 2, 3],
+        "filter": query,  # optional: apply same filter to KNN
+        "similarity": 0.2  # similarity threshold
+    },
+    "from": 0,  # pagination offset
+    "size": 60  # pagination size
+}
+
+# execute hybrid search
+results = client.search(index=test_table_name, body=body)
+# results is a list of matching documents
+```
+
+#### Supported Query Types
+
+The hybrid search supports Elasticsearch-compatible query syntax:
+
+- **`bool` query**: Combine multiple queries with `must`, `must_not`, `should`, `filter`
+- **`query_string`**: Full-text search with field weights, boost, and matching options
+- **`terms`**: Exact match filtering for multiple values
+- **`range`**: Range queries (`lt`, `lte`, `gt`, `gte`)
+- **`knn`**: Vector similarity search (KNN) with:
+  - `field`: Vector field name
+  - `query_vector`: Query vector
+  - `k`: Number of results to return
+  - `num_candidates`: Number of candidates to consider
+  - `filter`: Optional filter to apply to KNN search
+  - `similarity`: Similarity threshold
+- **Pagination**: `from` and `size` parameters
+
+#### Get SQL Query
+
+You can also get the actual SQL that will be executed:
+
+```python
+sql = client.get_sql(index=test_table_name, body=body)
+print(sql)  # prints the SQL query
+```