An AI-powered health and beauty product recommendation chatbot built with a Retrieval Augmented Generation (RAG) pipeline. DermaGPT helps users discover skincare, hair care, and vitamin/supplement products from a curated catalog of ~3,000 products, while also answering general wellness questions.
- Features
- Architecture
- Tech Stack
- Project Structure
- Getting Started
- Environment Variables
- Usage
- Data Pipeline
- RAG Pipeline Details
- Supported Brands
- Docker Deployment
- Limitations
- Product Recommendations - Find skincare, hair care, and vitamin/supplement products with natural language queries
- Smart Filtering - Filter by price range, brand, rating, and product category using conversational queries
- Dual Vector Store - FAISS (local) + Pinecone (cloud) for robust, redundant retrieval
- General Wellness Q&A - Answers health and beauty questions using web search (BraveSearch) or LLM knowledge
- Conversation Memory - Maintains chat context for follow-up questions within a session
- Graceful Fallbacks - Every component has a fallback path (see Fallback Strategy)
User Query
│
▼
┌─────────────────────┐
│ Query Classification │ ← LLM-based (GPT-4o) with keyword fallback
│ "product" / "general"│
└─────────┬───────────┘
│
┌─────┴─────┐
▼ ▼
┌────────┐ ┌────────────┐
│Product │ │ General │
│Pipeline│ │ Pipeline │
└───┬────┘ └─────┬──────┘
│ │
▼ ▼
┌────────────┐ ┌──────────────┐
│Category │ │BraveSearch │
│Detection │ │ or │
│+ Filter │ │LLM Knowledge │
│Extraction │ └──────┬───────┘
└─────┬──────┘ │
▼ │
┌──────────────────┐ │
│ Dual Retrieval │ │
│ FAISS + Pinecone │ │
│ (Ensemble) │ │
└─────┬────────────┘ │
▼ ▼
┌──────────────────────────┐
│ GPT-4o Response Gen │
│ + Conversation Memory │
└──────────────────────────┘
| Layer | Technology | Purpose |
|---|---|---|
| LLM | OpenAI GPT-4o | Query classification, response generation |
| Embeddings (primary) | OpenAI text-embedding-3-small (1536 dim) |
Document & query vectorization |
| Embeddings (fallback) | sentence-transformers/all-MiniLM-L6-v2 (384 dim) |
Free local alternative |
| Vector Store (local) | FAISS | Local similarity search |
| Vector Store (cloud) | Pinecone (serverless, AWS us-east-1) | Cloud-based vector DB |
| RAG Framework | LangChain 0.3.20 | Orchestration, retrievers, chains |
| Web Search | BraveSearch API | General query augmentation |
| Frontend | Streamlit | Chat-based web UI |
| Observability | LangSmith | LangChain tracing & debugging |
| Containerization | Docker + Docker Compose | Deployment |
DermaGPT/
├── main.py # Streamlit app + full RAG pipeline (production)
├── main.ipynb # Development notebook (data preprocessing + RAG prototyping)
├── requirements.txt # Python dependencies (145 packages)
├── Dockerfile # Python 3.13-slim container
├── docker-compose.yml # Service orchestration
├── .env # API keys (not committed)
├── .gitignore
├── assignment_details.pdf # Project requirements spec
│
├── data.csv # Original dataset (~3,000 products, 12 MB)
├── df_skin.csv # Preprocessed skincare products (2,079 rows)
├── df_hair.csv # Preprocessed hair care products (599 rows)
├── df_vits_supp.csv # Preprocessed vitamins/supplements (302 rows)
│
└── faiss_index/ # FAISS vector store (auto-generated, gitignored)
- Python 3.13+
- An OpenAI API key (required)
- A Pinecone API key (optional, falls back to FAISS-only)
- A BraveSearch API key (optional, falls back to LLM knowledge)
# Clone the repository
git clone git@github.com:yash2002vardhan/DermaGPT.git
cd DermaGPT
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Create .env file (see Environment Variables section)
# Run the app
streamlit run main.pyThe app will be available at http://localhost:8501.
# Build and run
docker compose up --build
# Access at http://localhost:8501Create a .env file in the project root:
OPENAI_API_KEY=sk-... # Required - powers GPT-4o and embeddings
PINECONE_API_KEY=pc-... # Optional - enables cloud vector store
BRAVE_API_KEY=BSA... # Optional - enables web search for general queries| Variable | Required | Impact if Missing |
|---|---|---|
OPENAI_API_KEY |
Yes | App will not function |
PINECONE_API_KEY |
No | Falls back to FAISS-only retrieval |
BRAVE_API_KEY |
No | General queries use LLM knowledge instead of web search |
Ask about products using natural language. The system extracts category, price, brand, and rating filters automatically.
"Find me a sunscreen under 500 rupees"
"Hair shampoo for hairfall between 300-800 rs"
"Abbott brand skincare products"
"Vitamin B complex rated above 4 stars"
"Best moisturizer for acne-prone skin"
"How to treat acne naturally?"
"Why do I have dandruff?"
"What vitamins help with hair growth?"
"Tips for glowing skin"
The chatbot maintains conversation context, so follow-up questions like "show me cheaper options" or "any alternatives?" work as expected.
The original dataset (data.csv) contains ~3,000 health and beauty products with fields including product title, vendor, category, price (INR), brand, description, FAQs, ingredients, key benefits, concerns, and review ratings.
- Load & Clean - Rename columns, drop fields with >50% missing values
- Extract Metadata - Parse structured fields (FAQs from JSON, benefits lists, ingredient lists, concern tags)
- Categorize - Split into three category-specific CSVs:
df_skin.csv- 2,079 skincare productsdf_hair.csv- 599 hair care productsdf_vits_supp.csv- 302 vitamins & supplements
- Chunk & Embed - Text splitting (500 char chunks, 100 char overlap) followed by vectorization
Incoming queries are classified as "product" or "general" using GPT-4o. If the LLM classification fails, a keyword-based fallback triggers on terms like: buy, recommend, price, brand, product, suggest, sunscreen, shampoo, serum, etc.
For product queries, regex patterns detect the relevant category:
- Skin: acne, pimple, moisturizer, sunscreen, wrinkles, dark spots, etc.
- Hair: hairfall, dandruff, shampoo, conditioner, split ends, frizz, etc.
- Vitamins: biotin, collagen, multivitamin, omega, zinc, iron, etc.
Natural language filters are parsed via regex:
- Price:
"under 500 rupees","between 300-800 rs","less than 1000" - Brand: Matched against 300+ indexed brands
- Rating:
"rated above 4","4.5 stars"
An Ensemble Retriever combines results from both stores with equal weighting:
| Store | Type | Index | Dimensions |
|---|---|---|---|
| FAISS | Local | faiss_index/ directory |
1536 (OpenAI) or 384 (HuggingFace) |
| Pinecone | Cloud (serverless) | clinikally-rag-2 |
1536 |
Pinecone uses three namespaces: skin, hair, vitamins_supplements. Both stores are populated from the preprocessed CSVs on first initialization.
- Product queries: Top 8 documents are formatted with title, price (INR), description, and category, then passed to GPT-4o for a numbered recommendation list
- General queries: BraveSearch fetches 5 web results (or LLM knowledge as fallback), combined with a health disclaimer
ConversationBufferMemory (LangChain) stores the full chat history per session, enabling contextual follow-ups. Memory is session-scoped via Streamlit's st.session_state.
| Component | Primary | Fallback |
|---|---|---|
| Query Classification | GPT-4o LLM call | Keyword pattern matching |
| Vector Store | Pinecone (cloud) | FAISS (local) |
| Retrieval | Ensemble (FAISS + Pinecone) | Individual retrievers |
| Web Search | BraveSearch API | LLM general knowledge |
| Embeddings | OpenAI text-embedding-3-small |
HuggingFace all-MiniLM-L6-v2 |
The system indexes 300+ brands for brand-specific filtering, including:
- Indian Pharma: Abbott, Cipla, Sun Pharma, Intas, Lupin, Glenmark
- International Dermatology: Galderma, La Roche-Posay, Bioderma, ISDIN, Sesderma
- Consumer Skincare: Neutrogena, Aveeno, CeraVe, Cetaphil, The Ordinary
- K-Beauty: COSRX, Some By Mi, Belif, The Face Shop, KAHI
See the full brand list in the select_retrievers() function in main.py.
- Base image:
python:3.13-slim - Installs
build-essentialfor compiled dependencies - Exposes port
8501(Streamlit default) - Runs
streamlit run main.py --server.address=0.0.0.0
- Mounts project directory as volume (enables hot reload during development)
- Loads environment variables from
.env - Restart policy:
unless-stopped
# Start
docker compose up --build -d
# View logs
docker compose logs -f app
# Stop
docker compose down- Product catalog is static - ~3,000 products with potentially outdated pricing; no live inventory integration
- Three categories only - Skincare, hair care, and vitamins/supplements
- Session-only memory - Conversation history is not persisted across sessions (no database backend)
- No authentication - No user accounts or personalized recommendations
- Hardcoded model - GPT-4o with non-configurable temperature/parameters
- Max 8 products per query - Response caps at 8 recommendations
- OpenAI dependency - Core functionality requires an OpenAI API key (paid)