This is a limited vector database and Retrieval-Augmented Generation (RAG) system developed with Python, made for R users, designed to generate bleeding-edge responses from challenging LLM prompts using your local R technical documentation.
This system:
- Ingests R documentation (
.md
,.R
,.Rmd
,.qmd
) from a single directory - Creates vector embeddings of the content and stores them in a
chromadb
vector database - Provides a query interface to ask questions about the R packages
- Uses an LLM to generate responses based on retrieved context
- Python 3.8+
- Required Python packages:
- langchain
- langchain-community
- langchain-anthropic (for Claude 3.7 Sonnet integration)
- langchain-text-splitters
- chromadb
- sentence-transformers
- argparse
- glob
- Clone this repository:
git clone JavOrraca/Vector-DB-and-RAG-Maker
cd Vector-DB-and-RAG-Maker
- Install the required dependencies:
pip install -r requirements.txt
- Set your Anthropic API key:
export ANTHROPIC_API_KEY=your-api-key
First, ingest all your R-related files from a single directory:
python src/main.py ingest --content-dir ./data --output-dir ./chroma_db
This command will:
- Find all
.md
,.R
,.Rmd
, and.qmd
files in the specified directory - Process them appropriately based on file type
- Store them in a single unified Chroma vector database
After ingestion, you can query the system:
# Interactive mode
python src/main.py query --db-path ./chroma_db/r_knowledge_base
# Single query mode
python src/main.py query --db-path ./chroma_db/r_knowledge_base --question "How do I use dplyr's filter function?"
The system can ingest and process the following file types:
- Markdown (
.md
) - Documentation, READMEs, etc. - R Files (
.R
) - R source code files - R Markdown (
.Rmd
) - Mixed R code and markdown - Quarto (
.qmd
) - Next-gen technical publishing framework (basically the successor to.Rmd
)
All files are processed appropriately based on their type and structure. If you want any additional file types, please reach out to Javier.
The ingestion pipeline:
- Recursively finds all supported files in the specified directory
- Processes each file type appropriately:
- Splits markdown files by headers and then into chunks
- Splits R code files into chunks
- Handles
.Rmd
and.qmd
files intelligently, attempting to parse them as markdown first
- Creates vector embeddings for each chunk
- Stores the embeddings in a unified Chroma vector database
The retrieval system:
- Takes a user question
- Searches the vector database for relevant context
- Combines the results
- Sends the most relevant context to an LLM
- Returns the LLM's response
By default, the system uses the sentence-transformers/all-MiniLM-L6-v2
model for embeddings. You can modify this in the code to use other models.
The system is configured to use Anthropic's Claude 3.7 Sonnet, but you can modify it to use other LLMs supported by LangChain.
You can adjust the chunking parameters in the code to better suit your needs:
chunk_size
: The size of text chunkschunk_overlap
: The amount of overlap between chunks