ETLCase Framework Documentation

ETLCase is a high-performance, async-first ETL framework designed to integrate Large Language Models (LLMs) directly into data processing workflows. It supports both high-throughput Batch processing and infinite Continuous streaming, providing real-time observability and AI-driven data analysis.

Project Architecture

The framework is divided into three primary layers:

1. The Core Engine (src/engine/)

Pipeline: The central orchestrator. It uses a decorator-based API to define tasks and dependencies. It automatically constructs a Directed Acyclic Graph (DAG) and performs a topological sort to ensure tasks run in the correct order.
PipelineContext: A thread-safe container passed to every task, holding artifacts, execution history, and the event log.
TaskInput: A unified interface that allows tasks to iterate over inputs (async for name, data in inputs) regardless of whether the pipeline is in Batch (one-off) or Continuous (queue-based) mode.
Profiling: Every task is monitored for duration, record count, status, and errors. A rolling event log tracks system-level signals.

2. LLM Integration (src/llm/)

LLMGateway: A unified gateway for AI interactions. It supports:
- Text Completion: Standard prompt/response.
- Vision API: Seamlessly handles image bytes in the prompt.
- Structured Output: Uses Pydantic models to force the LLM to return valid JSON, which is automatically parsed into Python objects.

3. Monitoring Dashboard (src/app/)

Real-time UI: A Streamlit-based dashboard that polls the running engine.
Live Profiling: Visualizes the status of every task, showing "Records Out" and real-time durations.
AI Analysis History: Displays findings from @DEF.analysis decorators. It maintains a historical list of results, showing the most recent insights at the top.
Usage Metrics: Breakdowns of LLM costs and token counts per provider.

LLM Utilities (src/engine/utils/llm.py)

The framework includes optimized utilities for common AI-assisted data tasks:

1. generate_structured_column

This utility generates a new structured column in a Polars DataFrame using an LLM. It is designed for maximum efficiency:

Automatic Batching: If the DataFrame contains multiple rows, it dynamically creates a wrapper Pydantic model to request a list of objects from the LLM in a single API call.
Schema Enforcement: Uses Pydantic to ensure the LLM output matches your required structure.
Integration: Returns a Polars DataFrame with the new column containing structured data (Structs).

2. find_anomalies

A powerful diagnostic tool for comparing two datasets (e.g., current chunk vs. historical data):

Statistical Analysis: Automatically calculates numerical distributions (mean, std, min, max, etc.) for both datasets.
Keyword Scanning: Scans text columns for specific error signals like "error", "fail", or "timeout".
AI Summary: Constructs a comprehensive report of statistical shifts and textual keyword matches, then uses the LLM to explain the differences in human-readable terms.

How to Define a New Pipeline

Pipelines are defined in Python files within the pipelines/ directory.

1. Basic Setup

import polars as pl
from src.engine.pipeline import Pipeline
from src.engine.common import ETLStage

# continuous=True enables asyncio.Queue-based streaming
DEF = Pipeline(continuous=False)

2. Defining Tasks

Use the @DEF.task decorator. Tasks are grouped by ETLStage (Extract, Transform, Load).

@DEF.task(stage=ETLStage.Extract, name="load_data")
async def extract(ctx):
    yield pl.read_csv("data.csv")

@DEF.task(stage=ETLStage.Transform, name="clean_data", dependencies=["load_data"])
async def transform(ctx, inputs):
    async for name, df in inputs:
        # data processing logic
        yield df.filter(pl.col("value") > 0)

3. Adding AI Analysis

Analyses run in the background as soon as a task emits data. They do not block the main data flow.

@DEF.analysis(task_name="clean_data", name="Anomaly Detector")
async def check_for_weirdness(ctx, current_output, previous_output):
    if current_output.height > 100:
        return "Warning: Unusually large data chunk detected."
    return None # Returning None hides the result from the UI

Capabilities: What You Can Build

1. Vision-Integrated ETL

The framework is optimized for multi-modal data. You can pipe raw image bytes from an Extract task into a Transform task that uses gateway.complete(images=[...]) to describe or OCR the images, then load the resulting metadata into a database.

2. AI-Driven Data Validation

Instead of writing complex regex or validation rules, use @DEF.analysis to let an LLM "watch" the data.

Example: In the money pipeline, an LLM analyzes joined datasets to report on the human impact of failed transactions.
Filtering: If the LLM finds nothing important, the analysis returns None, keeping the dashboard clutter-free.

3. Optimized AI Batching

Using src.engine.utils.llm.generate_structured_column, you can process entire DataFrames in a single LLM call. The utility automatically batches rows into a structured list prompt, significantly reducing latency and API costs compared to row-by-row processing.

4. Continuous Live Streams

Set continuous=True to build pipelines that never stop.

Extract: Polls an API, watches a folder for new files, or listens to a socket.
Transform/Load: Process data as it arrives.
Observability: The dashboard updates live as every individual chunk passes through the system.

Running the Project

Start the Dashboard:

uv run streamlit run src/app/dashboard.py

Select a Pipeline: Use the sidebar to choose between money, products, or micro.
Run: Click "Run" to see the engine execute the DAG, track costs, and display AI insights in real-time.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
pipelines		pipelines
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETLCase Framework Documentation

Project Architecture

1. The Core Engine (src/engine/)

2. LLM Integration (src/llm/)

3. Monitoring Dashboard (src/app/)

LLM Utilities (src/engine/utils/llm.py)

1. generate_structured_column

2. find_anomalies

How to Define a New Pipeline

1. Basic Setup

2. Defining Tasks

3. Adding AI Analysis

Capabilities: What You Can Build

1. Vision-Integrated ETL

2. AI-Driven Data Validation

3. Optimized AI Batching

4. Continuous Live Streams

Running the Project

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ETLCase Framework Documentation

Project Architecture

1. The Core Engine (src/engine/)

2. LLM Integration (src/llm/)

3. Monitoring Dashboard (src/app/)

LLM Utilities (src/engine/utils/llm.py)

1. generate_structured_column

2. find_anomalies

How to Define a New Pipeline

1. Basic Setup

2. Defining Tasks

3. Adding AI Analysis

Capabilities: What You Can Build

1. Vision-Integrated ETL

2. AI-Driven Data Validation

3. Optimized AI Batching

4. Continuous Live Streams

Running the Project

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages