Skip to content

Latest commit

 

History

History
432 lines (318 loc) · 12.3 KB

README.md

File metadata and controls

432 lines (318 loc) · 12.3 KB

DuckDB Spawn API

A FastAPI service that manages project data using DuckDB with dynamic schema support from an ontology server. The service is designed as a modular, self-contained data product following data mesh architecture principles.

Architecture Overview

graph TB
    subgraph "Client Layer"
        Client[API Clients]
    end

    subgraph "API Layer"
        FastAPI[FastAPI Service]
        Cache[Schema Cache]
        ConnMgr[Connection Manager]
    end

    subgraph "Data Layer"
        DuckDB[(DuckDB)]
    end

    subgraph "Schema Layer"
        OntoServer[Ontology Server]
        MockServer[Mock Server]
        SchemaAdapter[Schema Adapter]
    end

    Client --> FastAPI
    FastAPI --> ConnMgr
    ConnMgr --> DuckDB
    FastAPI --> SchemaAdapter
    SchemaAdapter --> |Primary| OntoServer
    SchemaAdapter --> |Fallback| MockServer
    SchemaAdapter --> |Cache| Cache
    Cache --> FastAPI
Loading

Deployment Architecture

graph TD
    subgraph Local Environment
        LocalDev[Local Development]
        Docker[Docker Container]
    end

    subgraph Infrastructure
        CI/CD[CI/CD Pipeline]
        Monitoring[Monitoring Tools]
    end

    subgraph Staging Environment
        KoyebStaging[Koyeb Staging]
    end

    subgraph Production Environment
        KoyebProd[Koyeb Production]
    end

    LocalDev --> Docker
    Docker --> CI/CD
    CI/CD --> KoyebStaging
    CI/CD --> KoyebProd
    KoyebStaging --> Monitoring
    KoyebProd --> Monitoring
Loading

Data Mesh Architecture Integration

DuckDB Spawn API is designed as a domain-oriented, self-contained data product within a data mesh architecture. It embodies the key principles of data mesh:

  1. Domain Ownership: Encapsulates project financing data domain with complete ownership over data storage, processing, and schema evolution
  2. Data as a Product: Provides well-defined APIs, documentation, and SLAs for data consumers
  3. Self-serve Data Platform: Uses infrastructure-as-code, containerization, and automated deployments for self-service capabilities
  4. Federated Computational Governance: Schema definitions from an ontology server implement federated governance while maintaining domain autonomy

Why This Matters

In a traditional data lake/warehouse architecture, this functionality might be implemented as tables in a central database, managed by a separate data team. In our data mesh approach:

  • Domain experts own both the code and data for their domain
  • Data schema evolves independently but adheres to organizational standards via the ontology server
  • The service is deployable independently without complex dependencies
  • Consumers interact with the data through APIs rather than direct database access

Architectural Decisions

Why DuckDB?

DuckDB was selected for several strategic reasons:

  • Analytical Performance: DuckDB excels at analytical queries, which align with project financing data needs
  • Embeddable Nature: No separate database server infrastructure required, simplifying deployment
  • Columnar Storage: Efficient storage and querying for project metrics and time-series data
  • Low Operational Overhead: Fits the data product model where each domain owns its complete stack
  • Schema Flexibility: Easily adaptable to changing schema requirements while maintaining performance

Why Dynamic Schema from Ontology Server?

The ontology server integration provides several benefits:

  • Schema Governance: Centralized schema definitions while maintaining domain autonomy
  • Evolution Control: Schema changes can be coordinated across multiple data products
  • Self-documenting API: API capabilities automatically reflect the current schema
  • Fallback Mechanism: Mock server ensures availability even when the ontology server is down

Connection Management Design

The connection pool implementation:

  • Thread Safety: Ensures concurrent requests don't conflict when accessing the database
  • Resource Efficiency: Reuses connections to minimize overhead
  • Proper Cleanup: Ensures connections are properly closed to prevent resource leaks

Key Features

  • Domain-Oriented Data Product: Complete encapsulation of project financing data domain
  • Dynamic Schema Management: Database tables are created and updated based on schemas from the ontology server
  • Connection Management: Thread-safe DuckDB connections with proper transaction handling
  • Mock Support: Built-in mock responses for development when the ontology server is unavailable
  • Health Monitoring: System metrics and health checks including ontology server status
  • Containerized Deployment: Self-contained, portable, and consistently deployable
  • Infrastructure as Code: Reproducible infrastructure defined in Pulumi
  • API-First Design: Clean, well-documented API endpoints for all operations

Configuration

Environment variables:

# Ontology Server Configuration
ONTO_SERVER_URL=http://localhost:8001
ONTO_SERVER_TIMEOUT=5
USE_MOCK_ONTO_SERVER=true  # Use mock responses for development

# Database Configuration
DUCKDB_PATH=data_product.db

API Endpoints

Operations

  • POST /ops/projects: Create a new project
  • GET /ops/projects: List all projects
  • GET /ops/projects/{project_id}: Get project details
  • POST /ops/initialize: Initialize database with schema

Admin

  • POST /admin/tables: Create tables from schema
  • GET /admin/tables: List all tables
  • PUT /admin/tables/{table_name}: Update table schema
  • DELETE /admin/tables/{table_name}: Delete table
  • POST /admin/logging/level: Update logging level

Monitoring

  • GET /monitoring/health: System health status
  • GET /monitoring/metrics/system: System metrics

Development

  1. Clone the repository

  2. Create a virtual environment:

python -m venv .venv
source .venv/bin/activate  # Unix
.venv\Scripts\activate     # Windows
  1. Install dependencies:
pip install -r requirements.txt
  1. Run the application:
uvicorn src.main:app --reload

GitHub Workflows

This project uses GitHub Actions for CI/CD pipelines, testing, and deployments. To use these workflows, you'll need to set up the following secrets:

Required Secrets

  • WORKFLOW_PAT: A GitHub Personal Access Token with repo and workflow scopes. This is used for actions that need to access the repository, especially for cross-repository checkout operations.
  • DOCKER_HUB_USERNAME: Your Docker Hub username
  • DOCKER_HUB_ACCESS_TOKEN: Docker Hub access token for pushing images
  • KOYEB_API_TOKEN: API token for Koyeb deployments

To create a Personal Access Token (PAT):

  1. Go to GitHub Settings → Developer Settings → Personal access tokens → Tokens (classic)
  2. Generate a new token with at least the repo and workflow scopes
  3. Add this token as a repository secret named WORKFLOW_PAT

Testing

Run tests with pytest:

pytest

Project Structure

duckdb-spawn/
├── config/
│   ├── __init__.py
│   ├── onto_server.py           # Schema server interface
│   └── mock_onto_responses.py   # Mock responses
├── infrastructure/
│   ├── pulumi/                  # Infrastructure as Code
│   │   ├── __main__.py         # Main Pulumi program
│   │   ├── Pulumi.yaml         # Pulumi project file
│   │   └── Pulumi.dev.yaml     # Development stack configuration
│   ├── docker/
│   │   ├── Dockerfile          # Application container
│   │   └── docker-compose.yml  # Local development setup
│   └── monitoring/
│       ├── prometheus/
│       │   └── prometheus.yml  # Prometheus configuration
│       └── grafana/
│           └── dashboards/     # Grafana dashboard definitions
├── src/
│   ├── __init__.py
│   ├── main.py
│   ├── routes/
│   │   ├── admin.py           # Admin endpoints
│   │   ├── operations.py      # Project operations
│   │   └── monitoring.py      # Health checks
│   ├── database/
│   │   ├── connection_manager.py
│   │   └── schema.py
│   └── utils/
│       ├── logging_config.py
│       └── metrics.py
├── tests/
│   └── test_routes/
├── .github/
│   └── workflows/
│       ├── ci.yml             # CI pipeline
│       └── koyeb-deploy.yml   # Deployment pipeline
└── deployment/
    ├── staging/              # Staging environment configs
    └── production/           # Production environment configs

Infrastructure

Local Development

The project includes a Docker Compose setup for local development:

# Start local development environment
docker-compose -f infrastructure/docker/docker-compose.yml up -d

# View logs
docker-compose -f infrastructure/docker/docker-compose.yml logs -f

Infrastructure as Code

The project uses Pulumi for infrastructure management:

# Initialize Pulumi stack
cd infrastructure/pulumi
pulumi stack init dev

# Deploy infrastructure
pulumi up

# Destroy infrastructure
pulumi destroy

Monitoring Setup

The monitoring stack includes:

  • Prometheus for metrics collection
  • Grafana for visualization
  • Custom dashboards for DuckDB metrics

To deploy the monitoring stack:

cd infrastructure/monitoring
docker-compose up -d

CI/CD Pipeline

The project uses GitHub Actions for CI/CD:

  1. Continuous Integration:

    • Automated testing
    • Code quality checks
    • Container image building
  2. Continuous Deployment:

    • Automated deployment to Koyeb
    • Environment-specific configurations
    • Health check verification

Environment Management

The project supports multiple environments:

  • Development: Local development environment
  • Staging: Pre-production testing environment
  • Production: Production environment

Environment-specific configurations are managed through:

  • Environment variables
  • Pulumi stacks
  • Koyeb configurations

Deployment

Koyeb Deployment

  1. Set up Koyeb credentials:
export KOYEB_TOKEN=your_token
  1. Deploy using GitHub Actions:
    • Push to main branch for staging deployment
    • Create a release for production deployment

Infrastructure Updates

To update the infrastructure:

  1. Modify Pulumi configurations:
cd infrastructure/pulumi
# Edit __main__.py or Pulumi.dev.yaml
pulumi up
  1. Update monitoring configurations:
cd infrastructure/monitoring
# Edit prometheus.yml or grafana dashboards
docker-compose up -d --force-recreate

Troubleshooting Docker Registry Secrets

When deploying to Koyeb with a private Docker registry, ensure:

  1. The secret exists:

    koyeb secret get DOCKER_REPO_SECRET
  2. The Docker registry secret has the correct format:

    koyeb secret create DOCKER_REPO_SECRET \
      --docker-registry-auth=YOUR_USERNAME:YOUR_PASSWORD \
      --docker-registry-server=docker.io \
      --type=registry
  3. The Docker image reference in the deployment command includes the full path:

    docker.io/username/duckdb-spawn:tag
    
  4. The deployment command correctly references the secret:

    --docker-private-registry-secret DOCKER_REPO_SECRET
    

Koyeb CLI Commands

Note that some Koyeb CLI commands might have changed. To verify Koyeb CLI installation and get help:

koyeb --help

To list available apps:

koyeb app list

To check service status:

koyeb service get -a app-name service-name

License

MIT License

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

Contact

Documentation

Detailed documentation for the DuckDB Spawn project is available in the docs directory:

  • Architecture: Comprehensive explanation of the system architecture and design decisions
  • Roadmap: Future development plans and feature timelines
  • Agentic Research: Research initiative on agentic data products using small language models
  • Sidecar Specification: Technical specification for the agentic sidecar implementation