Vector DB and RAG Maker

This is a limited vector database and Retrieval-Augmented Generation (RAG) system developed with Python, made for R users, designed to generate bleeding-edge responses from challenging LLM prompts using your local R technical documentation.

Overview

This system:

Ingests R documentation (.md, .R, .Rmd, .qmd) from a single directory
Creates vector embeddings of the content and stores them in a chromadb vector database
Provides a query interface to ask questions about the R packages
Uses an LLM to generate responses based on retrieved context

Requirements

Python 3.8+
Required Python packages:
- langchain
- langchain-community
- langchain-anthropic (for Claude 3.7 Sonnet integration)
- langchain-text-splitters
- chromadb
- sentence-transformers
- argparse
- glob

Installation

Clone this repository:

git clone JavOrraca/Vector-DB-and-RAG-Maker
cd Vector-DB-and-RAG-Maker

Install the required dependencies:

pip install -r requirements.txt

Set your Anthropic API key:

export ANTHROPIC_API_KEY=your-api-key

Usage

Ingesting R-related Files

First, ingest all your R-related files from a single directory:

python src/main.py ingest --content-dir ./data --output-dir ./chroma_db

This command will:

Find all .md, .R, .Rmd, and .qmd files in the specified directory
Process them appropriately based on file type
Store them in a single unified Chroma vector database

Querying the System

After ingestion, you can query the system:

# Interactive mode
python src/main.py query --db-path ./chroma_db/r_knowledge_base

# Single query mode
python src/main.py query --db-path ./chroma_db/r_knowledge_base --question "How do I use dplyr's filter function?"

Supported File Types

The system can ingest and process the following file types:

Markdown (.md) - Documentation, READMEs, etc.
R Files (.R) - R source code files
R Markdown (.Rmd) - Mixed R code and markdown
Quarto (.qmd) - Next-gen technical publishing framework (basically the successor to .Rmd)

All files are processed appropriately based on their type and structure. If you want any additional file types, please reach out to Javier.

System Components

Ingestion

The ingestion pipeline:

Recursively finds all supported files in the specified directory
Processes each file type appropriately:
- Splits markdown files by headers and then into chunks
- Splits R code files into chunks
- Handles .Rmd and .qmd files intelligently, attempting to parse them as markdown first
Creates vector embeddings for each chunk
Stores the embeddings in a unified Chroma vector database

Retrieval

The retrieval system:

Takes a user question
Searches the vector database for relevant context
Combines the results
Sends the most relevant context to an LLM
Returns the LLM's response

Customization

Embedding Models

By default, the system uses the sentence-transformers/all-MiniLM-L6-v2 model for embeddings. You can modify this in the code to use other models.

LLM

The system is configured to use Anthropic's Claude 3.7 Sonnet, but you can modify it to use other LLMs supported by LangChain.

Chunking Parameters

You can adjust the chunking parameters in the code to better suit your needs:

chunk_size: The size of text chunks
chunk_overlap: The amount of overlap between chunks

Name	Name	Last commit message	Last commit date
Latest commit JavOrraca Update README.md Mar 11, 2025 ea4969e · Mar 11, 2025 History 13 Commits
data	data	Added .gitkeep to retain data directory	Mar 7, 2025
src	src	Cleaning up first major push	Mar 7, 2025
.gitignore	.gitignore	Cleaning up first major push	Mar 7, 2025
LICENSE	LICENSE	Initial commit	Mar 7, 2025
README.md	README.md	Update README.md	Mar 11, 2025
requirements.txt	requirements.txt	Cleaning up first major push	Mar 7, 2025
retro_cartoon_robot.jpg	retro_cartoon_robot.jpg	Updated README	Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vector DB and RAG Maker

Overview

Requirements

Installation

Usage

Ingesting R-related Files

Querying the System

Supported File Types

System Components

Ingestion

Retrieval

Customization

Embedding Models

LLM

Chunking Parameters

About

Releases

Packages

Languages

License

JavOrraca/Vector-DB-and-RAG-Maker

Folders and files

Latest commit

History

Repository files navigation

Vector DB and RAG Maker

Overview

Requirements

Installation

Usage

Ingesting R-related Files

Querying the System

Supported File Types

System Components

Ingestion

Retrieval

Customization

Embedding Models

LLM

Chunking Parameters

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages