DataMate Runtime

Overview

DataMate Runtime provides core functionality for data processing, operator execution, and data collection, built on Python 3.12+ and the FastAPI framework.

Architecture

runtime/
├── datamate-python/      # FastAPI backend service (port 18000)
├── python-executor/      # Ray distributed executor
├── ops/                 # Operator ecosystem
├── datax/               # DataX data read/write framework
└── deer-flow/            # DeerFlow service

Components

1. datamate-python (FastAPI Backend)

Port: 18000

Core Python service providing:

Data Synthesis: QA generation, document processing
Data Annotation: Label Studio integration, auto-annotation
Data Evaluation: Model evaluation, quality checks
Data Cleaning: Data cleaning pipelines
Operator Marketplace: Operator management, upload
RAG Indexing: Vector indexing, knowledge base management
Data Collection: Scheduled tasks, data source integration

Technology Stack:

FastAPI 0.124+
SQLAlchemy 2.0+ (async)
Pydantic 2.12+
PostgreSQL (via asyncpg)
Milvus (via pymilvus)
APScheduler (scheduled tasks)

2. python-executor (Ray Executor)

Ray distributed execution framework responsible for:

Operator Execution: Execute data processing operators
Task Scheduling: Async task management
Distributed Computing: Multi-node parallel processing

Technology Stack:

Ray 2.7.0
FastAPI (executor API)
Data-Juicer (data processing)

3. ops (Operator Ecosystem)

Operator ecosystem including:

filter: Data filtering (deduplication, sensitive content, quality filtering)
mapper: Data transformation (cleaning, normalization)
slicer: Data slicing (text splitting, slide extraction)
formatter: Format conversion (PDF → text, slide → JSON)
llms: LLM operators (quality evaluation, condition checking)
annotation: Annotation operators (object detection, segmentation)

See: runtime/ops/README.md for operator development guide

4. datax (DataX Framework)

DataX data read/write framework supporting multiple data sources:

Readers: MySQL, PostgreSQL, Oracle, MongoDB, Elasticsearch, HDFS, S3, NFS, GlusterFS, API, etc.
Writers: Same as above, supports writing to targets

Technology Stack: Java (Maven build)

5. deer-flow (DeerFlow Service)

DeerFlow service (see conf.yaml for configuration).

Quick Start

Prerequisites

Python 3.12+
Poetry (for datamate-python)
Ray 2.7.0+ (for python-executor)

Run datamate-python

cd runtime/datamate-python
poetry install
poetry run uvicorn app.main:app --reload --port 18000

Run python-executor

cd runtime/python-executor
poetry install
ray start --head

Development

datamate-python Module Structure

app/
├── core/              # Logging, exception, config
├── db/
│   ├── models/        # SQLAlchemy models
│   └── session.py     # Async session
├── module/
│   ├── annotation/    # Label Studio integration
│   ├── collection/    # Data collection
│   ├── cleaning/      # Data cleaning
│   ├── dataset/       # Dataset management
│   ├── evaluation/    # Model evaluation
│   ├── generation/    # QA synthesis
│   ├── operator/      # Operator marketplace
│   ├── rag/           # RAG indexing
│   └── shared/        # Shared schemas
└── main.py            # FastAPI entry

Code Conventions

Routes: APIRouter in interface/*.py
Dependency Injection: Depends(get_db) for session
Error Handling: raise BusinessError(ErrorCodes.XXX, context)
Transactions: async with transaction(db):
Models: Extend BaseEntity (audit fields auto-filled)

Testing

cd runtime/datamate-python
poetry run pytest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataMate Runtime

Overview

Architecture

Components

1. datamate-python (FastAPI Backend)

2. python-executor (Ray Executor)

3. ops (Operator Ecosystem)

4. datax (DataX Framework)

5. deer-flow (DeerFlow Service)

Quick Start

Prerequisites

Run datamate-python

Run python-executor

Development

datamate-python Module Structure

Code Conventions

Testing

Configuration

Environment Variables

Documentation

Related Links

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

DataMate Runtime

Overview

Architecture

Components

1. datamate-python (FastAPI Backend)

2. python-executor (Ray Executor)

3. ops (Operator Ecosystem)

4. datax (DataX Framework)

5. deer-flow (DeerFlow Service)

Quick Start

Prerequisites

Run datamate-python

Run python-executor

Development

datamate-python Module Structure

Code Conventions

Testing

Configuration

Environment Variables

Documentation

Related Links