DataMate Runtime provides core functionality for data processing, operator execution, and data collection, built on Python 3.12+ and the FastAPI framework.
runtime/
├── datamate-python/ # FastAPI backend service (port 18000)
├── python-executor/ # Ray distributed executor
├── ops/ # Operator ecosystem
├── datax/ # DataX data read/write framework
└── deer-flow/ # DeerFlow service
Port: 18000
Core Python service providing:
- Data Synthesis: QA generation, document processing
- Data Annotation: Label Studio integration, auto-annotation
- Data Evaluation: Model evaluation, quality checks
- Data Cleaning: Data cleaning pipelines
- Operator Marketplace: Operator management, upload
- RAG Indexing: Vector indexing, knowledge base management
- Data Collection: Scheduled tasks, data source integration
Technology Stack:
- FastAPI 0.124+
- SQLAlchemy 2.0+ (async)
- Pydantic 2.12+
- PostgreSQL (via asyncpg)
- Milvus (via pymilvus)
- APScheduler (scheduled tasks)
Ray distributed execution framework responsible for:
- Operator Execution: Execute data processing operators
- Task Scheduling: Async task management
- Distributed Computing: Multi-node parallel processing
Technology Stack:
- Ray 2.7.0
- FastAPI (executor API)
- Data-Juicer (data processing)
Operator ecosystem including:
- filter: Data filtering (deduplication, sensitive content, quality filtering)
- mapper: Data transformation (cleaning, normalization)
- slicer: Data slicing (text splitting, slide extraction)
- formatter: Format conversion (PDF → text, slide → JSON)
- llms: LLM operators (quality evaluation, condition checking)
- annotation: Annotation operators (object detection, segmentation)
See: runtime/ops/README.md for operator development guide
DataX data read/write framework supporting multiple data sources:
- Readers: MySQL, PostgreSQL, Oracle, MongoDB, Elasticsearch, HDFS, S3, NFS, GlusterFS, API, etc.
- Writers: Same as above, supports writing to targets
Technology Stack: Java (Maven build)
DeerFlow service (see conf.yaml for configuration).
- Python 3.12+
- Poetry (for datamate-python)
- Ray 2.7.0+ (for python-executor)
cd runtime/datamate-python
poetry install
poetry run uvicorn app.main:app --reload --port 18000cd runtime/python-executor
poetry install
ray start --headapp/
├── core/ # Logging, exception, config
├── db/
│ ├── models/ # SQLAlchemy models
│ └── session.py # Async session
├── module/
│ ├── annotation/ # Label Studio integration
│ ├── collection/ # Data collection
│ ├── cleaning/ # Data cleaning
│ ├── dataset/ # Dataset management
│ ├── evaluation/ # Model evaluation
│ ├── generation/ # QA synthesis
│ ├── operator/ # Operator marketplace
│ ├── rag/ # RAG indexing
│ └── shared/ # Shared schemas
└── main.py # FastAPI entry
- Routes:
APIRouterininterface/*.py - Dependency Injection:
Depends(get_db)for session - Error Handling:
raise BusinessError(ErrorCodes.XXX, context) - Transactions:
async with transaction(db): - Models: Extend
BaseEntity(audit fields auto-filled)
cd runtime/datamate-python
poetry run pytestDATABASE_URL: PostgreSQL connection stringLABEL_STUDIO_BASE_URL: Label Studio URLRAY_ENABLED: Enable Ray executorRAY_ADDRESS: Ray cluster address
- API Docs: http://localhost:18000/redoc
- Operator Guide: See
runtime/ops/README.mdfor operator development