Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 15 additions & 19 deletions src/pyseekdb/client/client_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2546,14 +2546,13 @@ def _collection_hybrid_search(
# Execute SET statement
self._execute_query_with_cursor(conn, set_sql, [], use_context_manager)

# Get SQL query from DBMS_HYBRID_SEARCH.GET_SQL
get_sql_query = f"SELECT DBMS_HYBRID_SEARCH.GET_SQL('{table_name}', @search_parm) as query_sql FROM dual"
logger.debug(f"Getting SQL query: {get_sql_query}")
# Call DBMS_HYBRID_SEARCH.SEARCH directly instead of GET_SQL + SQL execution
search_query = f"SELECT DBMS_HYBRID_SEARCH.SEARCH('{table_name}', @search_parm) as search_result FROM dual"
logger.debug(f"Executing search query: {search_query}")

rows = self._execute_query_with_cursor(conn, get_sql_query, [], use_context_manager)

if not rows or not rows[0].get("query_sql"):
logger.warning("No SQL query returned from GET_SQL")
rows = self._execute_query_with_cursor(conn, search_query, [], use_context_manager)
Comment on lines +2549 to +2553
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find the method containing lines 2549-2553
rg -n "_collection_hybrid_search|def.*hybrid" src/pyseekdb/client/client_base.py -B 5 -A 30 | head -100

Repository: oceanbase/pyseekdb

Length of output: 1876


🏁 Script executed:

# Check the signature of _execute_query_with_cursor
rg -n "_execute_query_with_cursor" src/pyseekdb/client/client_base.py -B 2 -A 10 | head -60

Repository: oceanbase/pyseekdb

Length of output: 2494


🏁 Script executed:

# Understand how table_name is derived in the hybrid search context
rg -n "table_name.*=" src/pyseekdb/client/client_base.py | grep -A 2 -B 2 "2[45]"

Repository: oceanbase/pyseekdb

Length of output: 929


🏁 Script executed:

# Check what _transform_sql_result expects
rg -n "_transform_sql_result" src/pyseekdb/client/client_base.py -B 2 -A 5

Repository: oceanbase/pyseekdb

Length of output: 770


🏁 Script executed:

# Find CollectionNames class and table_name methods
rg -n "class CollectionNames|def table_name" src/pyseekdb -B 2 -A 8

Repository: oceanbase/pyseekdb

Length of output: 1576


🏁 Script executed:

# Look for other uses of DBMS_HYBRID_SEARCH in the codebase
rg -n "DBMS_HYBRID_SEARCH" src/pyseekdb --context 3

Repository: oceanbase/pyseekdb

Length of output: 1876


🏁 Script executed:

# Check how parameters are used in other execute_query_with_cursor calls
rg -n "_execute_query_with_cursor.*\[" src/pyseekdb/client/client_base.py | head -20

Repository: oceanbase/pyseekdb

Length of output: 331


🏁 Script executed:

# Look at lines 2549-2570 more carefully to understand the full context
sed -n '2540,2580p' src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 1873


🏁 Script executed:

# Find escape_string function
rg -n "def escape_string|escape_string" src/pyseekdb --context 2 | head -40

Repository: oceanbase/pyseekdb

Length of output: 2960


🏁 Script executed:

# Check how other SQL queries handle table_name interpolation
rg -n "SELECT.*\`.*table" src/pyseekdb/client/client_base.py | head -10

Repository: oceanbase/pyseekdb

Length of output: 460


🏁 Script executed:

# Look for backtick usage around table_name in other queries
rg -n "\`.*table_name" src/pyseekdb/client/client_base.py | head -15

Repository: oceanbase/pyseekdb

Length of output: 1795


🏁 Script executed:

# Check UPDATE statement with backticks (from earlier search showing line 1525)
sed -n '1520,1530p' src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 450


Escape or parameterize the table_name in the DBMS_HYBRID_SEARCH.SEARCH call to prevent SQL injection.

The table_name is derived from user-supplied collection_name via CollectionNames.table_name() and is interpolated directly into the SQL string without escaping or parameterization. Although the preceding SET statement uses escape_string() for the search parameters, this call lacks protection. Use parameter binding with %s to align with the parameterized approach already supported by _execute_query_with_cursor.

Proposed fix
-        search_query = f"SELECT DBMS_HYBRID_SEARCH.SEARCH('{table_name}', `@search_parm`) as search_result FROM dual"
-        logger.debug(f"Executing search query: {search_query}")
-
-        rows = self._execute_query_with_cursor(conn, search_query, [], use_context_manager)
+        search_query = "SELECT DBMS_HYBRID_SEARCH.SEARCH(%s, `@search_parm`) as search_result FROM dual"
+        logger.debug(f"Executing search query: {search_query}")
+
+        rows = self._execute_query_with_cursor(conn, search_query, [table_name], use_context_manager)
🧰 Tools
🪛 Ruff (0.14.14)

2550-2550: Possible SQL injection vector through string-based query construction

(S608)

🤖 Prompt for AI Agents
In `@src/pyseekdb/client/client_base.py` around lines 2549 - 2553, The SQL
currently injects table_name directly into search_query; instead build the query
with a parameter placeholder and pass table_name as a bound parameter to
_execute_query_with_cursor to prevent SQL injection: change the constructed
search_query to use a %s placeholder for the DBMS_HYBRID_SEARCH.SEARCH first
argument (e.g. DBMS_HYBRID_SEARCH.SEARCH(%s, `@search_parm`) ...) and call
self._execute_query_with_cursor(conn, search_query, [table_name],
use_context_manager) so the table name from CollectionNames.table_name() is
bound instead of interpolated.

if not rows or not rows[0].get("search_result"):
logger.warning("No result returned from DBMS_HYBRID_SEARCH.SEARCH")
return {
"ids": [[]],
"distances": [[]],
Expand All @@ -2562,19 +2561,16 @@ def _collection_hybrid_search(
"embeddings": [[]],
}

# Get the SQL query string
query_sql = rows[0]["query_sql"]
if isinstance(query_sql, str):
# Remove any surrounding quotes if present
query_sql = query_sql.strip().strip("'\"")

logger.debug(f"Executing query SQL: {query_sql}")

# Execute the returned SQL query
result_rows = self._execute_query_with_cursor(conn, query_sql, [], use_context_manager)
# Parse the search result JSON
search_result_json = rows[0]["search_result"]
if isinstance(search_result_json, str):
search_result = json.loads(search_result_json)
else:
search_result = search_result_json
Comment on lines +2564 to +2569
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file structure and the specific lines mentioned
wc -l src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 101


🏁 Script executed:

# Read the context around lines 2564-2573
sed -n '2540,2590p' src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 2233


🏁 Script executed:

# Find the _transform_sql_result method definition
rg -n "_transform_sql_result" src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 182


🏁 Script executed:

# Find the _transform_search_result method definition
rg -n "_transform_search_result" src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 182


🏁 Script executed:

# Search for DBMS_HYBRID_SEARCH.SEARCH usage
rg -n "DBMS_HYBRID_SEARCH\.SEARCH" src/pyseekdb/client/client_base.py -A 5 -B 5

Repository: oceanbase/pyseekdb

Length of output: 1494


🏁 Script executed:

# Read _transform_sql_result implementation
sed -n '3007,3060p' src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 2030


🏁 Script executed:

# Read _transform_search_result implementation
sed -n '3107,3170p' src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 2271


🏁 Script executed:

# Check type hints for both methods
sed -n '3007,3010p' src/pyseekdb/client/client_base.py
sed -n '3107,3110p' src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 539


Validate SEARCH response schema before transforming.

_transform_sql_result expects list[dict[str, Any]], but will silently corrupt output if passed a dict with nested structure (it would iterate over dict keys instead of rows). DBMS_HYBRID_SEARCH.SEARCH may return either a flat list or a nested dict with hits key, but the code provides no schema validation. Use the existing _transform_search_result method for dict responses or extract rows accordingly.

Suggested validation
-        # Transform search result to standard format
-        return self._transform_sql_result(search_result, include)
+        # Transform search result to standard format
+        if isinstance(search_result, dict) and "hits" in search_result:
+            return self._transform_search_result(search_result, include)
+        if isinstance(search_result, dict) and "rows" in search_result:
+            search_result = search_result["rows"]
+        if not isinstance(search_result, list):
+            raise ValueError(f"Unexpected SEARCH result schema: {type(search_result)}")
+        return self._transform_sql_result(search_result, include)
🤖 Prompt for AI Agents
In `@src/pyseekdb/client/client_base.py` around lines 2564 - 2569, The code reads
search_result_json and passes it to _transform_sql_result but doesn't validate
its schema; if search_result_json is a dict (e.g., a nested response with a
'hits' key from DBMS_HYBRID_SEARCH.SEARCH) _transform_sql_result will iterate
keys and corrupt output. Update the parsing in the block that sets search_result
(inspect variable search_result_json) to detect dict-shaped responses: if it's a
dict, either extract the list of rows (e.g., search_result_json["hits"] or
similar) or pass the dict to _transform_search_result to obtain a list[dict[str,
Any]] before calling _transform_sql_result; ensure the final search_result is
validated as a list of dicts before further processing and raise or log a clear
error if the schema is unexpected.

logger.debug(f"Search result received from DBMS_HYBRID_SEARCH.SEARCH")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove redundant f-string.

f"..." without placeholders triggers F541.

🧹 Proposed fix
-        logger.debug(f"Search result received from DBMS_HYBRID_SEARCH.SEARCH")
+        logger.debug("Search result received from DBMS_HYBRID_SEARCH.SEARCH")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logger.debug(f"Search result received from DBMS_HYBRID_SEARCH.SEARCH")
logger.debug("Search result received from DBMS_HYBRID_SEARCH.SEARCH")
🧰 Tools
🪛 Ruff (0.14.14)

2570-2570: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
In `@src/pyseekdb/client/client_base.py` at line 2570, Replace the redundant
f-string in the logging call so it isn't using an f-string with no placeholders;
change the logger.debug call that currently reads logger.debug(f"Search result
received from DBMS_HYBRID_SEARCH.SEARCH") to use a plain string
logger.debug("Search result received from DBMS_HYBRID_SEARCH.SEARCH") in the
same function where logger.debug is called.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no placeholders in this f-string, please check if this is intentional


# Transform SQL query results to standard format
return self._transform_sql_result(result_rows, include)
# Transform search result to standard format
return self._transform_sql_result(search_result, include)

def _build_search_parm( # noqa: C901
self,
Expand Down
Loading