MegaTran

Framework Overview

MegaTran is a two-stage framework that generates robust data transformation code from simple user requests. It bridges the gap between user intent and high-quality code by first strengthening the prompt and then using an optimized code generation process.

Stage 1: Weak2StrongPrompt

A lightweight LLM interprets the user's intent to create a strong prompt.
How it works:
- We use a fine-tuned, lightweight LLM that is specifically trained to analyze a user's request. It predicts the correct transformation "operator" (e.g., format, extract) and generates a detailed natural language description of the task.
- The fine-tuned model is avaliable for download at HuggingFace.

Stage 2: Prompt2Code

A powerful LLM generates code based on the strong prompt, guided by two key optimizations for error correction and knowledge retrieval.
How it works:
- Sanity-Check Reflection: An iterative self-correction mechanism that system checks code against a pre-defined while extensible checklist of common errors (e.g. syntax, value).
- Lazy-RAG: When an error occurs with a unknown library, it queries a pre-built vector database that contains documentation and code snippets from sites such as GitHub and PyPI.

Setup

Clone this repo and install dependencies

pip install -r requirements.txt

Configure environment variables

# Create .env file with your API keys
OPENAI_API_KEY=your_api_key_here

Start vLLM server for hosting the fine-tuned model

vllm serve \
    --model ./assets/models/llama3_lora_sft \ # wait for HF downloading ...
    --config ./etc/vllm-server.yaml

Note: You can use CUDA_VISIBLE_DEVICES to target the GPU device for vLLM server

Test Weak2Strong module inference

python w2s_prompt_inference.py -q "input:abc, output:ABC"

# Expected output: 
# format(): Convert the string to uppercase

[offline, optional] Build RAG vector database

# Build vector database for code libs retrieval
python scripts/build_vector_db.py \
    --config etc/vec_db.yaml \
    [-q "hijri date to gregorian date"] # test single query by adding this argument

A pre-built vector database is saved in assets/rag/code_db

User Guide

Execute the main pipeline

# Test mode (with smaller dataset)
python run.py \
    --config etc/mega-transform.yaml \
    --exp_name demo \
    --model gpt-4o-mini \
    --testing

# Full dataset run
python run.py \
    --config etc/mega-transform.yaml \
    --exp_name exp-1 \
    --model gpt-4o-mini \
    --dataset_name stackoverflow

Check experiment results as show in demo folder. Results will include:

Code Generation Results (per task)
Full test results (full_result.csv)
Summary statistics (task-level accuracy, token usage, etc.)
Runtime logs for current run

Project Structure

chat-transform/
├── run.py                # Main execution script
├── w2s_prompt_inference.py # Weak2strong prompt inference
├── etc/                  # Configuration files
│   ├── mega-transform.yaml # pipeline config
│   ├── code-llm.yaml       # baseline Code LLM
│   ├── vllm-server.yaml    # vLLM server config
│   └── vec_db.yaml         # RAG vector database config
├── framework/            # Core components
│   ├── chat_to_inst.py   # Chat to instruction conversion
│   ├── code_generator.py # Code generation
│   ├── lazy_rag.py       # Lazy RAG module
│   ├── reflection.py     # Sanity-check Reflection module
│   └── prompt_generator.py # Prompt composition
├── util/                 # Utility modules
│   ├── analyzer.py       # Result analysis and reporting
│   ├── load_data.py      # Data loading utilities
│   ├── context_manager.py # Context management
│   └── __init__.py
├── assets/               # Model assets
│   ├── models/           # Fine-tuned models
│   └── rag/              # RAG related files (Vec DB, list of missing packages)
├── scripts/              # Utility scripts
│   ├── build_vector_db.py # Build RAG vector database
│   ├── foundation_model.py # Foundation model baseline
│   └── push_to_hf.py      # Push to HF
├── temp/                 # Temporary files (on-the-fly generated code)
├── .env                  # Environment variables
└── requirements.txt      # Project dependencies

Baselines

Foundation model baseline, source code refer to the orginal implementation here

# Dataset: benchmark-stackoverflow
python scripts/foundation_model.py --dataset stackoverflow --model gpt-4o-mini

# Dataset: benchmark-BinqQuery (semantic)
python scripts/foundation_model.py --dataset bingquery-logs --model gpt-4o-mini

Naive code generation baseline:

python run.py \
    --config etc/code-llm.yaml \ # use code-llm config here
    --exp_name exp-1 \
    --model gpt-4o-mini \
    --dataset_name stackoverflow

Citation

If you find this work useful, please cite:

@article{DBLP:journals/pvldb/LiYLFT25,
  author       = {Changlun Li and
                  Chenyu Yang and
                  Yuyu Luo and
                  Ju Fan and
                  Nan Tang},
  title        = {Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy,
                  Low-Cost, and Explainable Data Transformation},
  journal      = {Proc. {VLDB} Endow.},
  volume       = {18},
  number       = {8},
  pages        = {2371--2384},
  year         = {2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MegaTran

Framework Overview

Setup

User Guide

Project Structure

Baselines

Citation

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
assets		assets
data		data
demo		demo
etc		etc
framework		framework
img		img
scripts		scripts
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
w2s_prompt_inference.py		w2s_prompt_inference.py

License

HKUSTDial/megatran

Folders and files

Latest commit

History

Repository files navigation

MegaTran

Framework Overview

Setup

User Guide

Project Structure

Baselines

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages