A FastAPI service that manages project data using DuckDB with dynamic schema support from an ontology server. The service is designed as a modular, self-contained data product following data mesh architecture principles.
graph TB
subgraph "Client Layer"
Client[API Clients]
end
subgraph "API Layer"
FastAPI[FastAPI Service]
Cache[Schema Cache]
ConnMgr[Connection Manager]
end
subgraph "Data Layer"
DuckDB[(DuckDB)]
end
subgraph "Schema Layer"
OntoServer[Ontology Server]
MockServer[Mock Server]
SchemaAdapter[Schema Adapter]
end
Client --> FastAPI
FastAPI --> ConnMgr
ConnMgr --> DuckDB
FastAPI --> SchemaAdapter
SchemaAdapter --> |Primary| OntoServer
SchemaAdapter --> |Fallback| MockServer
SchemaAdapter --> |Cache| Cache
Cache --> FastAPI
graph TD
subgraph Local Environment
LocalDev[Local Development]
Docker[Docker Container]
end
subgraph Infrastructure
CI/CD[CI/CD Pipeline]
Monitoring[Monitoring Tools]
end
subgraph Staging Environment
KoyebStaging[Koyeb Staging]
end
subgraph Production Environment
KoyebProd[Koyeb Production]
end
LocalDev --> Docker
Docker --> CI/CD
CI/CD --> KoyebStaging
CI/CD --> KoyebProd
KoyebStaging --> Monitoring
KoyebProd --> Monitoring
DuckDB Spawn API is designed as a domain-oriented, self-contained data product within a data mesh architecture. It embodies the key principles of data mesh:
- Domain Ownership: Encapsulates project financing data domain with complete ownership over data storage, processing, and schema evolution
- Data as a Product: Provides well-defined APIs, documentation, and SLAs for data consumers
- Self-serve Data Platform: Uses infrastructure-as-code, containerization, and automated deployments for self-service capabilities
- Federated Computational Governance: Schema definitions from an ontology server implement federated governance while maintaining domain autonomy
In a traditional data lake/warehouse architecture, this functionality might be implemented as tables in a central database, managed by a separate data team. In our data mesh approach:
- Domain experts own both the code and data for their domain
- Data schema evolves independently but adheres to organizational standards via the ontology server
- The service is deployable independently without complex dependencies
- Consumers interact with the data through APIs rather than direct database access
DuckDB was selected for several strategic reasons:
- Analytical Performance: DuckDB excels at analytical queries, which align with project financing data needs
- Embeddable Nature: No separate database server infrastructure required, simplifying deployment
- Columnar Storage: Efficient storage and querying for project metrics and time-series data
- Low Operational Overhead: Fits the data product model where each domain owns its complete stack
- Schema Flexibility: Easily adaptable to changing schema requirements while maintaining performance
The ontology server integration provides several benefits:
- Schema Governance: Centralized schema definitions while maintaining domain autonomy
- Evolution Control: Schema changes can be coordinated across multiple data products
- Self-documenting API: API capabilities automatically reflect the current schema
- Fallback Mechanism: Mock server ensures availability even when the ontology server is down
The connection pool implementation:
- Thread Safety: Ensures concurrent requests don't conflict when accessing the database
- Resource Efficiency: Reuses connections to minimize overhead
- Proper Cleanup: Ensures connections are properly closed to prevent resource leaks
- Domain-Oriented Data Product: Complete encapsulation of project financing data domain
- Dynamic Schema Management: Database tables are created and updated based on schemas from the ontology server
- Connection Management: Thread-safe DuckDB connections with proper transaction handling
- Mock Support: Built-in mock responses for development when the ontology server is unavailable
- Health Monitoring: System metrics and health checks including ontology server status
- Containerized Deployment: Self-contained, portable, and consistently deployable
- Infrastructure as Code: Reproducible infrastructure defined in Pulumi
- API-First Design: Clean, well-documented API endpoints for all operations
Environment variables:
# Ontology Server Configuration
ONTO_SERVER_URL=http://localhost:8001
ONTO_SERVER_TIMEOUT=5
USE_MOCK_ONTO_SERVER=true # Use mock responses for development
# Database Configuration
DUCKDB_PATH=data_product.db
POST /ops/projects
: Create a new projectGET /ops/projects
: List all projectsGET /ops/projects/{project_id}
: Get project detailsPOST /ops/initialize
: Initialize database with schema
POST /admin/tables
: Create tables from schemaGET /admin/tables
: List all tablesPUT /admin/tables/{table_name}
: Update table schemaDELETE /admin/tables/{table_name}
: Delete tablePOST /admin/logging/level
: Update logging level
GET /monitoring/health
: System health statusGET /monitoring/metrics/system
: System metrics
-
Clone the repository
-
Create a virtual environment:
python -m venv .venv
source .venv/bin/activate # Unix
.venv\Scripts\activate # Windows
- Install dependencies:
pip install -r requirements.txt
- Run the application:
uvicorn src.main:app --reload
This project uses GitHub Actions for CI/CD pipelines, testing, and deployments. To use these workflows, you'll need to set up the following secrets:
WORKFLOW_PAT
: A GitHub Personal Access Token withrepo
andworkflow
scopes. This is used for actions that need to access the repository, especially for cross-repository checkout operations.DOCKER_HUB_USERNAME
: Your Docker Hub usernameDOCKER_HUB_ACCESS_TOKEN
: Docker Hub access token for pushing imagesKOYEB_API_TOKEN
: API token for Koyeb deployments
To create a Personal Access Token (PAT):
- Go to GitHub Settings → Developer Settings → Personal access tokens → Tokens (classic)
- Generate a new token with at least the
repo
andworkflow
scopes - Add this token as a repository secret named
WORKFLOW_PAT
Run tests with pytest:
pytest
duckdb-spawn/
├── config/
│ ├── __init__.py
│ ├── onto_server.py # Schema server interface
│ └── mock_onto_responses.py # Mock responses
├── infrastructure/
│ ├── pulumi/ # Infrastructure as Code
│ │ ├── __main__.py # Main Pulumi program
│ │ ├── Pulumi.yaml # Pulumi project file
│ │ └── Pulumi.dev.yaml # Development stack configuration
│ ├── docker/
│ │ ├── Dockerfile # Application container
│ │ └── docker-compose.yml # Local development setup
│ └── monitoring/
│ ├── prometheus/
│ │ └── prometheus.yml # Prometheus configuration
│ └── grafana/
│ └── dashboards/ # Grafana dashboard definitions
├── src/
│ ├── __init__.py
│ ├── main.py
│ ├── routes/
│ │ ├── admin.py # Admin endpoints
│ │ ├── operations.py # Project operations
│ │ └── monitoring.py # Health checks
│ ├── database/
│ │ ├── connection_manager.py
│ │ └── schema.py
│ └── utils/
│ ├── logging_config.py
│ └── metrics.py
├── tests/
│ └── test_routes/
├── .github/
│ └── workflows/
│ ├── ci.yml # CI pipeline
│ └── koyeb-deploy.yml # Deployment pipeline
└── deployment/
├── staging/ # Staging environment configs
└── production/ # Production environment configs
The project includes a Docker Compose setup for local development:
# Start local development environment
docker-compose -f infrastructure/docker/docker-compose.yml up -d
# View logs
docker-compose -f infrastructure/docker/docker-compose.yml logs -f
The project uses Pulumi for infrastructure management:
# Initialize Pulumi stack
cd infrastructure/pulumi
pulumi stack init dev
# Deploy infrastructure
pulumi up
# Destroy infrastructure
pulumi destroy
The monitoring stack includes:
- Prometheus for metrics collection
- Grafana for visualization
- Custom dashboards for DuckDB metrics
To deploy the monitoring stack:
cd infrastructure/monitoring
docker-compose up -d
The project uses GitHub Actions for CI/CD:
-
Continuous Integration:
- Automated testing
- Code quality checks
- Container image building
-
Continuous Deployment:
- Automated deployment to Koyeb
- Environment-specific configurations
- Health check verification
The project supports multiple environments:
- Development: Local development environment
- Staging: Pre-production testing environment
- Production: Production environment
Environment-specific configurations are managed through:
- Environment variables
- Pulumi stacks
- Koyeb configurations
- Set up Koyeb credentials:
export KOYEB_TOKEN=your_token
- Deploy using GitHub Actions:
- Push to
main
branch for staging deployment - Create a release for production deployment
- Push to
To update the infrastructure:
- Modify Pulumi configurations:
cd infrastructure/pulumi
# Edit __main__.py or Pulumi.dev.yaml
pulumi up
- Update monitoring configurations:
cd infrastructure/monitoring
# Edit prometheus.yml or grafana dashboards
docker-compose up -d --force-recreate
When deploying to Koyeb with a private Docker registry, ensure:
-
The secret exists:
koyeb secret get DOCKER_REPO_SECRET
-
The Docker registry secret has the correct format:
koyeb secret create DOCKER_REPO_SECRET \ --docker-registry-auth=YOUR_USERNAME:YOUR_PASSWORD \ --docker-registry-server=docker.io \ --type=registry
-
The Docker image reference in the deployment command includes the full path:
docker.io/username/duckdb-spawn:tag
-
The deployment command correctly references the secret:
--docker-private-registry-secret DOCKER_REPO_SECRET
Note that some Koyeb CLI commands might have changed. To verify Koyeb CLI installation and get help:
koyeb --help
To list available apps:
koyeb app list
To check service status:
koyeb service get -a app-name service-name
MIT License
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
- Author: Jean-Baptiste Dezard
- Email: [email protected]
- Project: GitHub Repository
Detailed documentation for the DuckDB Spawn project is available in the docs
directory:
- Architecture: Comprehensive explanation of the system architecture and design decisions
- Roadmap: Future development plans and feature timelines
- Agentic Research: Research initiative on agentic data products using small language models
- Sidecar Specification: Technical specification for the agentic sidecar implementation