MegaTran is a two-stage framework that generates robust data transformation code from simple user requests. It bridges the gap between user intent and high-quality code by first strengthening the prompt and then using an optimized code generation process.
Stage 1: Weak2StrongPrompt
- A lightweight LLM interprets the user's intent to create a strong prompt.
- How it works:
- We use a fine-tuned, lightweight LLM that is specifically trained to analyze a user's request. It predicts the correct transformation "operator" (e.g., format, extract) and generates a detailed natural language description of the task.
- The fine-tuned model is avaliable for download at HuggingFace.
Stage 2: Prompt2Code
- A powerful LLM generates code based on the strong prompt, guided by two key optimizations for error correction and knowledge retrieval.
- How it works:
- Sanity-Check Reflection: An iterative self-correction mechanism that system checks code against a pre-defined while extensible checklist of common errors (e.g. syntax, value).
- Lazy-RAG: When an error occurs with a unknown library, it queries a pre-built vector database that contains documentation and code snippets from sites such as GitHub and PyPI.
- Clone this repo and install dependencies
pip install -r requirements.txt
- Configure environment variables
# Create .env file with your API keys
OPENAI_API_KEY=your_api_key_here
- Start
vLLM
server for hosting the fine-tuned model
vllm serve \
--model ./assets/models/llama3_lora_sft \ # wait for HF downloading ...
--config ./etc/vllm-server.yaml
Note: You can use
CUDA_VISIBLE_DEVICES
to target the GPU device for vLLM server
- Test Weak2Strong module inference
python w2s_prompt_inference.py -q "input:abc, output:ABC"
# Expected output:
# format(): Convert the string to uppercase
- [offline, optional] Build RAG vector database
# Build vector database for code libs retrieval
python scripts/build_vector_db.py \
--config etc/vec_db.yaml \
[-q "hijri date to gregorian date"] # test single query by adding this argument
A pre-built vector database is saved in
assets/rag/code_db
- Execute the main pipeline
# Test mode (with smaller dataset)
python run.py \
--config etc/mega-transform.yaml \
--exp_name demo \
--model gpt-4o-mini \
--testing
# Full dataset run
python run.py \
--config etc/mega-transform.yaml \
--exp_name exp-1 \
--model gpt-4o-mini \
--dataset_name stackoverflow
- Check experiment results as show in
demo
folder. Results will include:
- Code Generation Results (per task)
- Full test results (full_result.csv)
- Summary statistics (task-level accuracy, token usage, etc.)
- Runtime logs for current run
chat-transform/
├── run.py # Main execution script
├── w2s_prompt_inference.py # Weak2strong prompt inference
├── etc/ # Configuration files
│ ├── mega-transform.yaml # pipeline config
│ ├── code-llm.yaml # baseline Code LLM
│ ├── vllm-server.yaml # vLLM server config
│ └── vec_db.yaml # RAG vector database config
├── framework/ # Core components
│ ├── chat_to_inst.py # Chat to instruction conversion
│ ├── code_generator.py # Code generation
│ ├── lazy_rag.py # Lazy RAG module
│ ├── reflection.py # Sanity-check Reflection module
│ └── prompt_generator.py # Prompt composition
├── util/ # Utility modules
│ ├── analyzer.py # Result analysis and reporting
│ ├── load_data.py # Data loading utilities
│ ├── context_manager.py # Context management
│ └── __init__.py
├── assets/ # Model assets
│ ├── models/ # Fine-tuned models
│ └── rag/ # RAG related files (Vec DB, list of missing packages)
├── scripts/ # Utility scripts
│ ├── build_vector_db.py # Build RAG vector database
│ ├── foundation_model.py # Foundation model baseline
│ └── push_to_hf.py # Push to HF
├── temp/ # Temporary files (on-the-fly generated code)
├── .env # Environment variables
└── requirements.txt # Project dependencies
Foundation model baseline, source code refer to the orginal implementation here
# Dataset: benchmark-stackoverflow
python scripts/foundation_model.py --dataset stackoverflow --model gpt-4o-mini
# Dataset: benchmark-BinqQuery (semantic)
python scripts/foundation_model.py --dataset bingquery-logs --model gpt-4o-mini
Naive code generation baseline:
python run.py \
--config etc/code-llm.yaml \ # use code-llm config here
--exp_name exp-1 \
--model gpt-4o-mini \
--dataset_name stackoverflow
If you find this work useful, please cite:
@article{DBLP:journals/pvldb/LiYLFT25,
author = {Changlun Li and
Chenyu Yang and
Yuyu Luo and
Ju Fan and
Nan Tang},
title = {Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy,
Low-Cost, and Explainable Data Transformation},
journal = {Proc. {VLDB} Endow.},
volume = {18},
number = {8},
pages = {2371--2384},
year = {2025}
}