MaskSQL is a privacy-preserving framework for LLM-based text-to-SQL that uses schema masking and progressive unmasking to protect sensitive database information while maintaining high query accuracy.
- Python 3.11
- uv package manager
Install dependencies and activate the virtual environment:
uv sync --dev
source .venv/bin/activateDownload and extract the dataset:
wget -O data.zip "https://www.dropbox.com/scl/fi/vtraf79vfi1x105veaflk/data.zip?rlkey=7yq6d46aer6h45pdihrc9rht1&st=zdac3rqx&dl=0"
unzip data.zipExpected directory structure:
data/
├── databases/
├── 1_input.json
└── ...
Create a .env file from the template:
cp .env.example .envRequired:
OPENAI_API_KEY: Your OpenRouter API key
Optional:
LIMIT: Number of dataset entries to process (e.g.,LIMIT=10)START: Starting index in the dataset (default: 0)SLM_MODEL: Small language model ID (e.g.,openai/gpt-4.1)LLM_MODEL: Large language model ID
MaskSQL requires RESDSQL for initial schema filtering. Follow the RESDSQL setup instructions to generate the required files.
Execute the MaskSQL pipeline:
python3 main.py --resdMaskSQL saves intermediate results for reuse. To run from scratch:
./clean.sh data- MaskSQL Framework - Overview of the framework architecture
- Pipeline Stages - Detailed explanation of each pipeline stage
If you use MaskSQL in your research, please cite our paper:
@article{abedini2025masksql,
title={MaskSQL: Safeguarding Privacy for LLM-Based Text-to-SQL via Abstraction},
author={Abedini, Sepideh and Mohapatra, Shubhankar and Emerson, DB and Shafieinejad, Masoumeh and Cresswell, Jesse C and He, Xi},
journal={arXiv preprint arXiv:2509.23459},
year={2025}
}