Skip to content

[VLDB'25] Official repo for Paper "Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy, Low-Cost, and Explainable Data Transformation"

License

Notifications You must be signed in to change notification settings

HKUSTDial/megatran

Repository files navigation

MegaTran

Python Read Paper HuggingFace Model Citation

Framework Overview

Framework

MegaTran is a two-stage framework that generates robust data transformation code from simple user requests. It bridges the gap between user intent and high-quality code by first strengthening the prompt and then using an optimized code generation process.

Stage 1: Weak2StrongPrompt

  • A lightweight LLM interprets the user's intent to create a strong prompt.
  • How it works:
    • We use a fine-tuned, lightweight LLM that is specifically trained to analyze a user's request. It predicts the correct transformation "operator" (e.g., format, extract) and generates a detailed natural language description of the task.
    • The fine-tuned model is avaliable for download at HuggingFace.

Stage 2: Prompt2Code

  • A powerful LLM generates code based on the strong prompt, guided by two key optimizations for error correction and knowledge retrieval.
  • How it works:
    • Sanity-Check Reflection: An iterative self-correction mechanism that system checks code against a pre-defined while extensible checklist of common errors (e.g. syntax, value).
    • Lazy-RAG: When an error occurs with a unknown library, it queries a pre-built vector database that contains documentation and code snippets from sites such as GitHub and PyPI.

Setup

  1. Clone this repo and install dependencies
pip install -r requirements.txt
  1. Configure environment variables
# Create .env file with your API keys
OPENAI_API_KEY=your_api_key_here
  1. Start vLLM server for hosting the fine-tuned model
vllm serve \
    --model ./assets/models/llama3_lora_sft \ # wait for HF downloading ...
    --config ./etc/vllm-server.yaml

Note: You can use CUDA_VISIBLE_DEVICES to target the GPU device for vLLM server

  1. Test Weak2Strong module inference
python w2s_prompt_inference.py -q "input:abc, output:ABC"

# Expected output: 
# format(): Convert the string to uppercase
  1. [offline, optional] Build RAG vector database
# Build vector database for code libs retrieval
python scripts/build_vector_db.py \
    --config etc/vec_db.yaml \
    [-q "hijri date to gregorian date"] # test single query by adding this argument

A pre-built vector database is saved in assets/rag/code_db

User Guide

  1. Execute the main pipeline
# Test mode (with smaller dataset)
python run.py \
    --config etc/mega-transform.yaml \
    --exp_name demo \
    --model gpt-4o-mini \
    --testing

# Full dataset run
python run.py \
    --config etc/mega-transform.yaml \
    --exp_name exp-1 \
    --model gpt-4o-mini \
    --dataset_name stackoverflow
  1. Check experiment results as show in demo folder. Results will include:
  • Code Generation Results (per task)
  • Full test results (full_result.csv)
  • Summary statistics (task-level accuracy, token usage, etc.)
  • Runtime logs for current run

Project Structure

chat-transform/
├── run.py                # Main execution script
├── w2s_prompt_inference.py # Weak2strong prompt inference
├── etc/                  # Configuration files
│   ├── mega-transform.yaml # pipeline config
│   ├── code-llm.yaml       # baseline Code LLM
│   ├── vllm-server.yaml    # vLLM server config
│   └── vec_db.yaml         # RAG vector database config
├── framework/            # Core components
│   ├── chat_to_inst.py   # Chat to instruction conversion
│   ├── code_generator.py # Code generation
│   ├── lazy_rag.py       # Lazy RAG module
│   ├── reflection.py     # Sanity-check Reflection module
│   └── prompt_generator.py # Prompt composition
├── util/                 # Utility modules
│   ├── analyzer.py       # Result analysis and reporting
│   ├── load_data.py      # Data loading utilities
│   ├── context_manager.py # Context management
│   └── __init__.py
├── assets/               # Model assets
│   ├── models/           # Fine-tuned models
│   └── rag/              # RAG related files (Vec DB, list of missing packages)
├── scripts/              # Utility scripts
│   ├── build_vector_db.py # Build RAG vector database
│   ├── foundation_model.py # Foundation model baseline
│   └── push_to_hf.py      # Push to HF
├── temp/                 # Temporary files (on-the-fly generated code)
├── .env                  # Environment variables
└── requirements.txt      # Project dependencies

Baselines

Foundation model baseline, source code refer to the orginal implementation here

# Dataset: benchmark-stackoverflow
python scripts/foundation_model.py --dataset stackoverflow --model gpt-4o-mini

# Dataset: benchmark-BinqQuery (semantic)
python scripts/foundation_model.py --dataset bingquery-logs --model gpt-4o-mini

Naive code generation baseline:

python run.py \
    --config etc/code-llm.yaml \ # use code-llm config here
    --exp_name exp-1 \
    --model gpt-4o-mini \
    --dataset_name stackoverflow

Citation

If you find this work useful, please cite:

@article{DBLP:journals/pvldb/LiYLFT25,
  author       = {Changlun Li and
                  Chenyu Yang and
                  Yuyu Luo and
                  Ju Fan and
                  Nan Tang},
  title        = {Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy,
                  Low-Cost, and Explainable Data Transformation},
  journal      = {Proc. {VLDB} Endow.},
  volume       = {18},
  number       = {8},
  pages        = {2371--2384},
  year         = {2025}
}

About

[VLDB'25] Official repo for Paper "Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy, Low-Cost, and Explainable Data Transformation"

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages