ChromaDB Document Ingestion & Listing Demo

The full tutorial is available on the Upsun Devcenter.

A multi-language demonstration of document ingestion and web-based listing using ChromaDB, OpenAI embeddings, and modern web frameworks. This project showcases identical functionality implemented in both Python and Node.js/TypeScript.

Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Markdown      │    │    ChromaDB      │    │   Web Apps      │
│   Documents     │────│   Collections    │────│  Python/Node.js │
│   (.md files)   │    │  (Embeddings)    │    │   (Listing UI)  │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                             │
                    ┌────────▼────────┐
                    │   OpenAI API    │
                    │  (Embeddings)   │
                    └─────────────────┘

Features

Document Ingestion: Processes markdown files into semantic chunks
Vector Embeddings: Uses OpenAI's text-embedding-3-small model
ChromaDB Storage: Efficient vector database for similarity search
Web Interface: Clean UI showing ingested files and chunk counts
Multi-language: Identical functionality in Python and TypeScript
Cloud Ready: Configured for Upsun platform deployment
Environment Flexible: Supports both local and remote ChromaDB instances

Project Structure

├── python-app/           # Python Flask implementation
│   ├── main.py          # Web server (Flask)
│   ├── ingest.py        # Document processing script
│   ├── pyproject.toml   # Python dependencies
│   ├── data/            # Markdown files
│   └── .env.example     # Environment template
├── nodejs-app/          # Node.js TypeScript implementation  
│   ├── src/
│   │   ├── index.ts     # Web server (Express)
│   │   └── ingest.ts    # Document processing script
│   ├── package.json     # Node.js dependencies
│   ├── tsconfig.json    # TypeScript configuration
│   ├── data/            # Markdown files
│   └── .env.example     # Environment template
├── .upsun/
│   └── config.yaml      # Upsun platform configuration
└── README.md            # This file

Technology Stack

Python App

Flask - Web framework
ChromaDB - Vector database client
OpenAI - Embedding generation
UV - Fast Python package manager

Node.js App

Express - Web framework
ChromaDB - Vector database client
OpenAI - Embedding generation
TypeScript - Type safety

Infrastructure

ChromaDB - Vector database server
Upsun - Cloud platform
OpenAI API - Embedding service

Environment Configuration

Both applications support flexible ChromaDB connections via environment variables:

# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here

# ChromaDB Configuration (leave empty for local instance)
CHROMA_HOST=                    # e.g., chroma.example.com
CHROMA_PORT=8000               # Default: 8000
CHROMA_SSL=false               # true for HTTPS
CHROMA_AUTH_TOKEN=             # Optional authentication

Local Development

Python App

cd python-app

# Install dependencies
uv sync

# Set environment variables
cp .env.example .env
# Edit .env with your OpenAI API key

# Run ingestion (processes markdown files)
uv run python ingest.py

# Start web server
uv run python main.py
# Visit: http://localhost:5000

Node.js App

cd nodejs-app

# Install dependencies
npm ci

# Set environment variables  
cp .env.example .env
# Edit .env with your OpenAI API key

# Build TypeScript
npm run build

# Run ingestion (processes markdown files)
npm run ingest

# Start web server
npm run start
# Visit: http://localhost:3000

Deployment (Upsun)

This project is configured for deployment on the Upsun platform with three services:

ChromaDB Service: Vector database server
Python App: Flask web application
Node.js App: Express web application

Deployment URLs

Python App: https://python.{your-domain}/
Node.js App: https://nodejs.{your-domain}/

Automatic Ingestion

Both applications automatically run their ingestion scripts during deployment, ensuring fresh data on every deploy.

How It Works

1. Document Processing

Reads all .md files from the data/ directory
Splits documents into overlapping chunks (1000 words, 200 word overlap)
Generates unique IDs using content hashing

2. Embedding Generation

Python: Manual OpenAI API calls with batching
Node.js: ChromaDB's built-in OpenAI embedding function

3. Vector Storage

Stores chunks with metadata (filename, filepath, chunk index)
Collections named python-app and nodejs-app respectively
Automatically clears existing data on re-ingestion

4. Web Interface

Lists all processed files with chunk counts
Shows collection statistics (total files, total chunks)
Handles errors gracefully with helpful messages

Development Commands

Python App

# Type checking and linting (if available)
uv run python -m mypy .

# Run development server with auto-reload
uv run python main.py

Node.js App

# Type checking
npm run type-check

# Development mode with auto-reload
npm run dev

# Development ingestion (uses tsx)
npm run ingest:dev

Sample Data

The data/ directories contain sample markdown files covering various technical topics:

Advanced prompting techniques
Platform.sh build pipelines
Python development with UV
PyTorch deployment
Configuration as code
Environment management

Contributing

Fork the repository
Create a feature branch
Make your changes
Test both Python and Node.js implementations
Submit a pull request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Links

Troubleshooting

Common Issues

"Collection not found" error:

Run the ingestion script first: python ingest.py or npm run ingest

OpenAI API errors:

Verify your OPENAI_API_KEY is set correctly
Check your OpenAI account has sufficient credits

ChromaDB connection issues:

For local development, ensure no CHROMA_HOST is set
For remote instances, verify CHROMA_HOST, CHROMA_PORT, and CHROMA_SSL settings

TypeScript compilation errors:

Run npm run build to compile TypeScript before using npm run start
Use npm run dev for development with auto-compilation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ChromaDB Document Ingestion & Listing Demo

Architecture

Features

Project Structure

Technology Stack

Python App

Node.js App

Infrastructure

Environment Configuration

Local Development

Python App

Node.js App

Deployment (Upsun)

Deployment URLs

Automatic Ingestion

How It Works

1. Document Processing

2. Embedding Generation

3. Vector Storage

4. Web Interface

Development Commands

Python App

Node.js App

Sample Data

Contributing

License

Links

Troubleshooting

Common Issues

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.upsun		.upsun
chroma		chroma
nodejs-app		nodejs-app
python-app		python-app
LICENSE		LICENSE
README.md		README.md

License

upsun/tutorial-chromadb

Folders and files

Latest commit

History

Repository files navigation

ChromaDB Document Ingestion & Listing Demo

Architecture

Features

Project Structure

Technology Stack

Python App

Node.js App

Infrastructure

Environment Configuration

Local Development

Python App

Node.js App

Deployment (Upsun)

Deployment URLs

Automatic Ingestion

How It Works

1. Document Processing

2. Embedding Generation

3. Vector Storage

4. Web Interface

Development Commands

Python App

Node.js App

Sample Data

Contributing

License

Links

Troubleshooting

Common Issues

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages