This tutorial will guide you through importing Markdown documents into seekdb, building a hybrid search knowledge base, and launching a RAG interface via Streamlit.
We provide a complete RAG (Retrieval-Augmented Generation) demo application that demonstrates how to build a hybrid search knowledge base using pyseekdb. The demo includes:
- Document Import: Import Markdown files or directory into seekdb
- Vector Search: Semantic search over imported documents
- RAG Interface: Interactive Streamlit web interface for querying
The demo supports three embedding modes:
default: Uses pyseekdb's built-inDefaultEmbeddingFunction(ONNX-based, 384 dimensions). No API key required, automatically downloads models on first use.local: Uses sentence-transformers models (e.g., all-mpnet-base-v2, 768 dimensions). Requires installing sentence-transformers library.api: Uses OpenAI-compatible Embedding API services (e.g., DashScope, OpenAI). Requires API key configuration.
- Python 3.11 or higher installed
- uv package manager installed
- LLM API Key ready
By default, the commands below assume you are in demo/rag. The demo is configured to use the local pyseekdb source from this repository via uv workspace. If you prefer to stay at the repository root, use these workspace commands.
Sync dependencies (workspace):
uv sync --project demo/ragInstall local embedding extra for the demo:
uv sync --project demo/rag --extra localRun the demo from the repo root:
uv run --project demo/rag streamlit run demo/rag/seekdb_app.pyVerify the demo uses the local pyseekdb source:
uv run --project demo/rag python -c "import os, pyseekdb; print(os.path.abspath(pyseekdb.__file__))"The printed path should point to src/pyseekdb in this repository.
You can also run make demo from the repository root.
Note: The commands in this section assume you are in
demo/rag. If you are running from the repository root, use the workspace workflow above.
Basic installation (for default or api embedding types):
uv syncWith local embedding support (for local embedding type):
uv sync --extra localNote:
- The
localextra includessentence-transformersand related dependencies (~2-3GB).- If you experience slow download speeds, you can use mirror sources to accelerate:
- Basic installation (Tsinghua mirror):
uv sync --index-url https://pypi.tuna.tsinghua.edu.cn/simple- Basic installation (Aliyun mirror):
uv sync --index-url https://mirrors.aliyun.com/pypi/simple- Local model (Tsinghua mirror):
uv sync --extra local --index-url https://pypi.tuna.tsinghua.edu.cn/simple- Local model (Aliyun mirror):
uv sync --extra local --index-url https://mirrors.aliyun.com/pypi/simple
Step 1: Copy the environment variable template
cp .env.example .envStep 2: Edit the .env file and configure environment variables
The system supports three types of Embedding functions. You can choose based on your needs:
1. default (Default, recommended for beginners)
- Uses pyseekdb's built-in
DefaultEmbeddingFunction(based on ONNX) - Automatically downloads the model on first use, no API Key configuration required
- Suitable for local development and testing
2. local (Local model)
- Uses custom sentence-transformers models
- Requires installation of
sentence-transformerslibrary - Configurable model name and device (CPU/GPU)
3. api (API service)
- Uses OpenAI-compatible Embedding API (such as DashScope, OpenAI, etc.)
- Requires API Key and model name configuration
- Suitable for production environments
The following example uses Qwen (with api type):
# Embedding Function type: api, local, default
EMBEDDING_FUNCTION_TYPE=api
# LLM configuration (for generating answers)
OPENAI_API_KEY=sk-your-dashscope-key
OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
OPENAI_MODEL_NAME=qwen-plus
# Embedding API configuration (required only when EMBEDDING_FUNCTION_TYPE=api)
EMBEDDING_API_KEY=sk-your-dashscope-key
EMBEDDING_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
EMBEDDING_MODEL_NAME=text-embedding-v4
# Local model configuration (required only when EMBEDDING_FUNCTION_TYPE=local)
SENTENCE_TRANSFORMERS_MODEL_NAME=all-mpnet-base-v2
SENTENCE_TRANSFORMERS_DEVICE=cpu
# seekdb configuration
SEEKDB_DIR=./data/seekdb_rag
SEEKDB_NAME=test
COLLECTION_NAME=embeddingsEnvironment Variable Reference:
| Variable Name | Description | Default/Example Value | Required Condition |
|---|---|---|---|
| EMBEDDING_FUNCTION_TYPE | Embedding function type | default (options: api, local, default) |
Required |
| OPENAI_API_KEY | LLM API Key (supports OpenAI, Qwen, etc.) | Must be set | Required (for generating answers) |
| OPENAI_BASE_URL | LLM API base URL | https://dashscope.aliyuncs.com/compatible-mode/v1 | Optional |
| OPENAI_MODEL_NAME | Language model name | qwen-plus | Optional |
| EMBEDDING_API_KEY | Embedding API Key | - | Required when EMBEDDING_FUNCTION_TYPE=api |
| EMBEDDING_BASE_URL | Embedding API base URL | https://dashscope.aliyuncs.com/compatible-mode/v1 | Optional when EMBEDDING_FUNCTION_TYPE=api |
| EMBEDDING_MODEL_NAME | Embedding model name | text-embedding-v4 | Required when EMBEDDING_FUNCTION_TYPE=api |
| SENTENCE_TRANSFORMERS_MODEL_NAME | Local model name | all-mpnet-base-v2 | Optional when EMBEDDING_FUNCTION_TYPE=local |
| SENTENCE_TRANSFORMERS_DEVICE | Device to run on | cpu | Optional when EMBEDDING_FUNCTION_TYPE=local |
| SEEKDB_DIR | seekdb database directory | ./data/seekdb_rag | Optional |
| SEEKDB_NAME | Database name | test | Optional |
| COLLECTION_NAME | Collection name | embeddings | Optional |
Tip:
- If using
defaulttype, only configureEMBEDDING_FUNCTION_TYPE=defaultand LLM-related settings- If using
apitype, additional Embedding API variables need to be configured- If using
localtype, install thesentence-transformerslibrary and optionally configure the model name
We use pyseekdb's SDK documentation as an example. You can also use your own Markdown documents or directory.
Import Data:
Run the data import script:
# Import a single file
uv run python seekdb_insert.py ../../README.md
# Or import all markdown files from a directory
uv run python seekdb_insert.py path/to/your_dirImport Instructions:
During this step, the system will perform the following operations:
- Read the specified Markdown file or all Markdown files in the directory
- Split documents into text chunks by headers (using
#separator) - Select the appropriate embedding function based on
EMBEDDING_FUNCTION_TYPEconfigured in.env:default: Uses pyseekdb's built-inDefaultEmbeddingFunction(automatically downloads model on first use)local: Uses custom sentence-transformers modelapi: Uses configured Embedding API service
- Automatically generate text embedding vectors
- Store embedding vectors in seekdb database
- Automatically skip failed document chunks to ensure batch processing stability
Launch the application via Streamlit:
uv run streamlit run seekdb_app.pyAfter launching, you can access the RAG interface in your browser to query your data.
Tip: When using the
uvpackage manager, use theuv runprefix to run commands to ensure the correct Python environment and dependencies are used.