This document provides a detailed explanation of the architectural decisions, data mesh principles, and design patterns used in the DuckDB Spawn API.
Data Mesh is an architectural paradigm that aims to distribute data ownership to domain teams, treating data as a product, and implementing federated governance for interoperability. The DuckDB Spawn API is designed as a self-contained data product within a data mesh architecture, embodying its key principles.
DuckDB Spawn encapsulates the project financing domain as a self-contained service that:
- Owns its own data storage (DuckDB database)
- Implements domain-specific business logic (project operations)
- Provides domain-specific APIs (project financing endpoints)
- Manages its own infrastructure (Pulumi IaC, Docker containers)
This approach shifts from centralized data platforms to domain-oriented architectures where domain experts have autonomy over their data products.
The service treats its data as a product by:
- Providing well-documented, versioned APIs
- Implementing comprehensive health checks and monitoring
- Offering schema discovery and introspection
- Including proper error handling and feedback mechanisms
- Ensuring discoverability through standard endpoints
The infrastructure is designed to be self-service:
- Containerization enables consistent deployment across environments
- Infrastructure as Code (Pulumi) allows reproducible deployments
- CI/CD pipelines automate testing and deployment
- Configuration management via environment variables simplifies customization
- Local development environment closely mirrors production
Governance is implemented through:
- Schema definitions from a central ontology server
- Local enforcement of schema constraints
- Standard API patterns across endpoints
- Consistent error handling and response formats
- Monitoring and observability standards
Traditional approaches might use PostgreSQL, MySQL, or a cloud database. We chose DuckDB for:
-
Self-Contained Nature: As an embedded database, DuckDB requires no separate infrastructure, aligning perfectly with the "self-contained data product" principle of data mesh.
-
Analytical Performance: DuckDB is designed for analytical workloads with columnar storage, which matches the project financing domain's needs for reporting and analysis.
-
Low Operational Overhead: No database servers to maintain, backup, or scale, which simplifies domain team ownership.
-
Right-Sized Solution: For project financing data volumes, DuckDB provides sufficient performance without excessive complexity.
-
SQL Compatibility: Teams familiar with SQL can work with DuckDB without significant retraining.
- Limited to a single node (no distributed queries)
- Less suitable for extremely high write throughput
- Backup and replication need custom solutions
-
Federated Governance: Implements governance without centralized control, allowing domains to evolve independently while maintaining organizational standards.
-
Schema Evolution: Enables schema changes without service deployments by fetching updated schemas from the ontology server.
-
Discoverability: Ontology server provides a registry of schemas that can be discovered by other data products.
-
Fallback Mechanism: Mock responses ensure the service remains functional even when disconnected from central governance.
- Schema cache reduces ontology server dependency
- Mock server supports local development
- Schema validation ensures data integrity
- Clear separation of schema definition and enforcement
The connection pool implementation addresses critical needs for a robust data product:
-
Thread Safety: FastAPI's async nature requires thread-safe database access, which our connection pool provides.
-
Resource Efficiency: Reusing connections improves performance and reduces resource usage.
-
Error Handling: Comprehensive error handling ensures database issues don't cascade to service failures.
-
Graceful Shutdown: Proper connection cleanup prevents resource leaks.
Pulumi was chosen for infrastructure management because:
-
Programmatic Control: Using Python for infrastructure definitions aligns with the application codebase.
-
State Management: Pulumi's state management enables reproducible deployments.
-
Multi-Cloud Support: While currently deployed to Koyeb, the infrastructure definitions can be adapted to other providers.
-
Resource Dependencies: Pulumi's dependency management ensures proper resource creation order.
The containerization approach:
-
Multi-Stage Builds: Minimizes container size and improves security.
-
Non-Root User: Running as a non-root user improves security posture.
-
Environment Configuration: Configuration via environment variables enables deployment across environments.
-
Volume Management: Persistent volumes for DuckDB data ensure data durability.
Data products must be observable to be trustworthy. Our approach includes:
-
Health Checks: Comprehensive health endpoint verifies database connectivity and ontology server status.
-
Prometheus Integration: Metrics collection for monitoring service performance.
-
Structured Logging: JSON logging facilitates log aggregation and analysis.
-
Tracing: Request tracing for performance analysis.
The API implements a simplified CQRS pattern:
- Command operations (
POST /ops/projects
): Create or modify data - Query operations (
GET /ops/projects
): Retrieve data
This separation allows for performance optimization and scalability.
The ontology server integration implements a circuit breaker pattern with:
- Timeout configuration to prevent cascade failures
- Fallback to mock responses when the server is unavailable
- Caching to reduce dependency on the ontology server
The service implements several security measures:
-
Input Validation: Pydantic models validate input data.
-
Rate Limiting: Prevents abuse of the API.
-
CORS Configuration: Restricts cross-origin requests to approved domains.
-
Trusted Host Middleware: Prevents host header attacks.
As the data mesh architecture evolves, several enhancements could be considered:
-
Authentication and Authorization: Implementing OAuth or API keys for access control.
-
Data Lineage: Tracking data provenance across the mesh.
-
Schema Evolution: Versioning and compatibility management.
-
Event-Driven Integration: Publishing data change events for other data products.
-
Data Contracts: Formalized agreements between data producers and consumers.
The DuckDB Spawn API demonstrates how a domain-specific data product can be implemented following data mesh principles. By combining the right technologies (DuckDB, FastAPI, Pulumi) with architectural patterns suited to domain ownership, the service achieves both independence and interoperability within the larger organizational data ecosystem.