Skip to content

Commit f5db550

Browse files
authored
docs: add hybrid search documentation (#41)
1 parent 30cc6a7 commit f5db550

File tree

1 file changed

+172
-5
lines changed

1 file changed

+172
-5
lines changed

README.md

Lines changed: 172 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -33,10 +33,11 @@ For detailed release notes and changelog, see [RELEASE_NOTES.md](RELEASE_NOTES.m
3333

3434
## Usage
3535

36-
`pyobvector` supports two modes:
36+
`pyobvector` supports three modes:
3737

3838
- `Milvus compatible mode`: You can use the `MilvusLikeClient` class to use vector storage in a way similar to the Milvus API
3939
- `SQLAlchemy hybrid mode`: You can use the vector storage function provided by the `ObVecClient` class and execute the relational database statement with the SQLAlchemy library. In this mode, you can regard `pyobvector` as an extension of SQLAlchemy.
40+
- `Hybrid Search mode`: You can use the `HybridSearch` class to perform hybrid search that combines full-text search and vector similarity search, with Elasticsearch-compatible query syntax.
4041

4142
### Milvus compatible mode
4243

@@ -210,22 +211,21 @@ res = self.client.ann_search(
210211
The `ann_search` method supports flexible output column selection through the `output_columns` parameter:
211212

212213
- **`output_columns`** (recommended): Accepts SQLAlchemy Column objects, expressions, or a mix of both
214+
213215
- Column objects: `table.c.id`, `table.c.name`
214216
- Expressions: `(table.c.age + 10).label('age_plus_10')`
215217
- JSON queries: `text("JSON_EXTRACT(meta, '$.key') as extracted_key")`
216218
- String functions: `func.concat(table.c.name, ' (', table.c.age, ')').label('name_age')`
217-
218219
- **`output_column_names`** (legacy): Accepts list of column name strings
219-
- Example: `['id', 'name', 'meta']`
220220

221+
- Example: `['id', 'name', 'meta']`
221222
- **Parameter Priority**: `output_columns` takes precedence over `output_column_names` when both are provided
222-
223223
- **`distance_threshold`** (optional): Filter results by distance threshold
224+
224225
- Type: `Optional[float]`
225226
- Only returns results where `distance <= threshold`
226227
- Example: `distance_threshold=0.5` returns only results with distance <= 0.5
227228
- Use case: Quality control for similarity search, only return highly similar results
228-
229229
- If you want to use pure `SQLAlchemy` API with `OceanBase` dialect, you can just get an `SQLAlchemy.engine` via `client.engine`. The engine can also be created as following:
230230

231231
```python
@@ -263,3 +263,170 @@ engine = create_async_engine(connection_str)
263263
```
264264

265265
- For further usage in pure `SQLAlchemy` mode, please refer to [SQLAlchemy](https://www.sqlalchemy.org/)
266+
267+
### Hybrid Search Mode
268+
269+
`pyobvector` supports hybrid search that combines full-text search and vector similarity search, with query syntax compatible with Elasticsearch. This allows you to perform semantic search with both keyword matching and vector similarity in a single query.
270+
271+
- setup a client:
272+
273+
```python
274+
from pyobvector import *
275+
from pyobvector.client.hybrid_search import HybridSearch
276+
from sqlalchemy import Column, Integer, VARCHAR
277+
278+
client = HybridSearch(uri="127.0.0.1:2881", user="test@test")
279+
```
280+
281+
**Note**: Hybrid search requires OceanBase version >= 4.4.1.0, or SeekDB.
282+
283+
- create a table with both vector index and full-text index:
284+
285+
```python
286+
test_table_name = "hybrid_search_test"
287+
288+
# create table with vector and text columns
289+
client.create_table(
290+
table_name=test_table_name,
291+
columns=[
292+
Column("id", Integer, primary_key=True, autoincrement=False),
293+
Column("source_id", VARCHAR(32)),
294+
Column("enabled", Integer),
295+
Column("vector", VECTOR(3)), # vector column
296+
Column("title", VARCHAR(255)), # text column for full-text search
297+
Column("content", VARCHAR(255)), # text column for full-text search
298+
],
299+
indexes=[
300+
VectorIndex("vec_idx", "vector", params="distance=l2, type=hnsw, lib=vsag"),
301+
],
302+
mysql_charset='utf8mb4',
303+
mysql_collate='utf8mb4_unicode_ci',
304+
)
305+
306+
# create full-text indexes for text columns
307+
from pyobvector import FtsIndexParam, FtsParser
308+
309+
for col in ["title", "content"]:
310+
client.create_fts_idx_with_fts_index_param(
311+
table_name=test_table_name,
312+
fts_idx_param=FtsIndexParam(
313+
index_name=f"fts_idx_{col}",
314+
field_names=[col],
315+
parser_type=FtsParser.IK, # or other parser types
316+
),
317+
)
318+
```
319+
320+
- insert data:
321+
322+
```python
323+
client.insert(
324+
table_name=test_table_name,
325+
data=[
326+
{
327+
"id": 1,
328+
"source_id": "3b767712b57211f09c170242ac130008",
329+
"enabled": 1,
330+
"vector": [1, 1, 1],
331+
"title": "企业版和社区版的功能差异",
332+
"content": "OceanBase 数据库提供企业版和社区版两种形态。",
333+
},
334+
{
335+
"id": 2,
336+
"vector": [1, 2, 3],
337+
"enabled": 1,
338+
"source_id": "3b791472b57211f09c170242ac130008",
339+
"title": "快速体验 OceanBase 社区版",
340+
"content": "本文根据使用场景详细介绍如何快速部署 OceanBase 数据库。",
341+
},
342+
# ... more data
343+
]
344+
)
345+
```
346+
347+
- perform hybrid search with Elasticsearch-compatible query syntax:
348+
349+
```python
350+
# build query body (compatible with Elasticsearch syntax)
351+
query = {
352+
"bool": {
353+
"must": [
354+
{
355+
"query_string": {
356+
"fields": ["title^10", "content"], # field weights
357+
"type": "best_fields",
358+
"query": "oceanbase 数据 迁移",
359+
"minimum_should_match": "30%",
360+
"boost": 1
361+
}
362+
}
363+
],
364+
"filter": [
365+
{
366+
"terms": {
367+
"source_id": [
368+
"3b791472b57211f09c170242ac130008",
369+
"3b7af31eb57211f09c170242ac130008"
370+
]
371+
}
372+
},
373+
{
374+
"bool": {
375+
"must_not": [
376+
{
377+
"range": {
378+
"enabled": {"lt": 1}
379+
}
380+
}
381+
]
382+
}
383+
}
384+
],
385+
"boost": 0.7
386+
}
387+
}
388+
389+
body = {
390+
"query": query,
391+
"knn": { # vector similarity search
392+
"field": "vector",
393+
"k": 1024,
394+
"num_candidates": 1024,
395+
"query_vector": [1, 2, 3],
396+
"filter": query, # optional: apply same filter to KNN
397+
"similarity": 0.2 # similarity threshold
398+
},
399+
"from": 0, # pagination offset
400+
"size": 60 # pagination size
401+
}
402+
403+
# execute hybrid search
404+
results = client.search(index=test_table_name, body=body)
405+
# results is a list of matching documents
406+
```
407+
408+
#### Supported Query Types
409+
410+
The hybrid search supports Elasticsearch-compatible query syntax:
411+
412+
- **`bool` query**: Combine multiple queries with `must`, `must_not`, `should`, `filter`
413+
- **`query_string`**: Full-text search with field weights, boost, and matching options
414+
- **`terms`**: Exact match filtering for multiple values
415+
- **`range`**: Range queries (`lt`, `lte`, `gt`, `gte`)
416+
- **`knn`**: Vector similarity search (KNN) with:
417+
- `field`: Vector field name
418+
- `query_vector`: Query vector
419+
- `k`: Number of results to return
420+
- `num_candidates`: Number of candidates to consider
421+
- `filter`: Optional filter to apply to KNN search
422+
- `similarity`: Similarity threshold
423+
- **Pagination**: `from` and `size` parameters
424+
425+
#### Get SQL Query
426+
427+
You can also get the actual SQL that will be executed:
428+
429+
```python
430+
sql = client.get_sql(index=test_table_name, body=body)
431+
print(sql) # prints the SQL query
432+
```

0 commit comments

Comments
 (0)