Skip to content

Latest commit

 

History

History
200 lines (140 loc) · 8.94 KB

File metadata and controls

200 lines (140 loc) · 8.94 KB

Building RAG with seekdb

This tutorial will guide you through importing Markdown documents into seekdb, building a hybrid search knowledge base, and launching a RAG interface via Streamlit.

RAG Demo

We provide a complete RAG (Retrieval-Augmented Generation) demo application that demonstrates how to build a hybrid search knowledge base using pyseekdb. The demo includes:

  • Document Import: Import Markdown files or directory into seekdb
  • Vector Search: Semantic search over imported documents
  • RAG Interface: Interactive Streamlit web interface for querying

The demo supports three embedding modes:

  • default: Uses pyseekdb's built-in DefaultEmbeddingFunction (ONNX-based, 384 dimensions). No API key required, automatically downloads models on first use.
  • local: Uses sentence-transformers models (e.g., all-mpnet-base-v2, 768 dimensions). Requires installing sentence-transformers library.
  • api: Uses OpenAI-compatible Embedding API services (e.g., DashScope, OpenAI). Requires API key configuration.

Prerequisites

  • Python 3.11 or higher installed
  • uv package manager installed
  • LLM API Key ready

Workspace workflow (optional)

By default, the commands below assume you are in demo/rag. The demo is configured to use the local pyseekdb source from this repository via uv workspace. If you prefer to stay at the repository root, use these workspace commands.

Sync dependencies (workspace):

uv sync --project demo/rag

Install local embedding extra for the demo:

uv sync --project demo/rag --extra local

Run the demo from the repo root:

uv run --project demo/rag streamlit run demo/rag/seekdb_app.py

Verify the demo uses the local pyseekdb source:

uv run --project demo/rag python -c "import os, pyseekdb; print(os.path.abspath(pyseekdb.__file__))"

The printed path should point to src/pyseekdb in this repository.

You can also run make demo from the repository root.

Setup

1. Environment Setup

Install Dependencies

Note: The commands in this section assume you are in demo/rag. If you are running from the repository root, use the workspace workflow above.

Basic installation (for default or api embedding types):

uv sync

With local embedding support (for local embedding type):

uv sync --extra local

Note:

  • The local extra includes sentence-transformers and related dependencies (~2-3GB).
  • If you experience slow download speeds, you can use mirror sources to accelerate:
    • Basic installation (Tsinghua mirror): uv sync --index-url https://pypi.tuna.tsinghua.edu.cn/simple
    • Basic installation (Aliyun mirror): uv sync --index-url https://mirrors.aliyun.com/pypi/simple
    • Local model (Tsinghua mirror): uv sync --extra local --index-url https://pypi.tuna.tsinghua.edu.cn/simple
    • Local model (Aliyun mirror): uv sync --extra local --index-url https://mirrors.aliyun.com/pypi/simple

Configure Environment Variables

Step 1: Copy the environment variable template

cp .env.example .env

Step 2: Edit the .env file and configure environment variables

The system supports three types of Embedding functions. You can choose based on your needs:

1. default (Default, recommended for beginners)

  • Uses pyseekdb's built-in DefaultEmbeddingFunction (based on ONNX)
  • Automatically downloads the model on first use, no API Key configuration required
  • Suitable for local development and testing

2. local (Local model)

  • Uses custom sentence-transformers models
  • Requires installation of sentence-transformers library
  • Configurable model name and device (CPU/GPU)

3. api (API service)

  • Uses OpenAI-compatible Embedding API (such as DashScope, OpenAI, etc.)
  • Requires API Key and model name configuration
  • Suitable for production environments

The following example uses Qwen (with api type):

# Embedding Function type: api, local, default
EMBEDDING_FUNCTION_TYPE=api

# LLM configuration (for generating answers)
OPENAI_API_KEY=sk-your-dashscope-key
OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
OPENAI_MODEL_NAME=qwen-plus

# Embedding API configuration (required only when EMBEDDING_FUNCTION_TYPE=api)
EMBEDDING_API_KEY=sk-your-dashscope-key
EMBEDDING_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
EMBEDDING_MODEL_NAME=text-embedding-v4

# Local model configuration (required only when EMBEDDING_FUNCTION_TYPE=local)
SENTENCE_TRANSFORMERS_MODEL_NAME=all-mpnet-base-v2
SENTENCE_TRANSFORMERS_DEVICE=cpu

# seekdb configuration
SEEKDB_DIR=./data/seekdb_rag
SEEKDB_NAME=test
COLLECTION_NAME=embeddings

Environment Variable Reference:

Variable Name Description Default/Example Value Required Condition
EMBEDDING_FUNCTION_TYPE Embedding function type default (options: api, local, default) Required
OPENAI_API_KEY LLM API Key (supports OpenAI, Qwen, etc.) Must be set Required (for generating answers)
OPENAI_BASE_URL LLM API base URL https://dashscope.aliyuncs.com/compatible-mode/v1 Optional
OPENAI_MODEL_NAME Language model name qwen-plus Optional
EMBEDDING_API_KEY Embedding API Key - Required when EMBEDDING_FUNCTION_TYPE=api
EMBEDDING_BASE_URL Embedding API base URL https://dashscope.aliyuncs.com/compatible-mode/v1 Optional when EMBEDDING_FUNCTION_TYPE=api
EMBEDDING_MODEL_NAME Embedding model name text-embedding-v4 Required when EMBEDDING_FUNCTION_TYPE=api
SENTENCE_TRANSFORMERS_MODEL_NAME Local model name all-mpnet-base-v2 Optional when EMBEDDING_FUNCTION_TYPE=local
SENTENCE_TRANSFORMERS_DEVICE Device to run on cpu Optional when EMBEDDING_FUNCTION_TYPE=local
SEEKDB_DIR seekdb database directory ./data/seekdb_rag Optional
SEEKDB_NAME Database name test Optional
COLLECTION_NAME Collection name embeddings Optional

Tip:

  • If using default type, only configure EMBEDDING_FUNCTION_TYPE=default and LLM-related settings
  • If using api type, additional Embedding API variables need to be configured
  • If using local type, install the sentence-transformers library and optionally configure the model name

2. Prepare Data

We use pyseekdb's SDK documentation as an example. You can also use your own Markdown documents or directory.

Import Data:

Run the data import script:

# Import a single file
uv run python seekdb_insert.py ../../README.md

# Or import all markdown files from a directory
uv run python seekdb_insert.py path/to/your_dir

Import Instructions:

During this step, the system will perform the following operations:

  • Read the specified Markdown file or all Markdown files in the directory
  • Split documents into text chunks by headers (using # separator)
  • Select the appropriate embedding function based on EMBEDDING_FUNCTION_TYPE configured in .env:
    • default: Uses pyseekdb's built-in DefaultEmbeddingFunction (automatically downloads model on first use)
    • local: Uses custom sentence-transformers model
    • api: Uses configured Embedding API service
  • Automatically generate text embedding vectors
  • Store embedding vectors in seekdb database
  • Automatically skip failed document chunks to ensure batch processing stability

Build RAG

Launch the application via Streamlit:

uv run streamlit run seekdb_app.py

After launching, you can access the RAG interface in your browser to query your data.

Tip: When using the uv package manager, use the uv run prefix to run commands to ensure the correct Python environment and dependencies are used.