Skip to content

VectorInstitute/masksql

 
 

MaskSQL

MaskSQL is a privacy-preserving framework for LLM-based text-to-SQL that uses schema masking and progressive unmasking to protect sensitive database information while maintaining high query accuracy.

Table of Contents

Installation and Setup Instructions

Prerequisites

  • Python 3.11
  • uv package manager

Setup

Install dependencies and activate the virtual environment:

uv sync --dev
source .venv/bin/activate

Download Dataset

Download and extract the dataset:

wget -O data.zip "https://www.dropbox.com/scl/fi/vtraf79vfi1x105veaflk/data.zip?rlkey=7yq6d46aer6h45pdihrc9rht1&st=zdac3rqx&dl=0"
unzip data.zip

Expected directory structure:

data/
├── databases/
├── 1_input.json
└── ...

Configure Environment

Create a .env file from the template:

cp .env.example .env

Required:

Optional:

  • LIMIT: Number of dataset entries to process (e.g., LIMIT=10)
  • START: Starting index in the dataset (default: 0)
  • SLM_MODEL: Small language model ID (e.g., openai/gpt-4.1)
  • LLM_MODEL: Large language model ID

Running MaskSQL

1. Run RESDSQL (Schema Filtering)

MaskSQL requires RESDSQL for initial schema filtering. Follow the RESDSQL setup instructions to generate the required files.

2. Run the Pipeline

Execute the MaskSQL pipeline:

python3 main.py --resd

3. Clean Intermediate Results (Optional)

MaskSQL saves intermediate results for reuse. To run from scratch:

./clean.sh data

Documentation

Citation

If you use MaskSQL in your research, please cite our paper:

@article{abedini2025masksql,
  title={MaskSQL: Safeguarding Privacy for LLM-Based Text-to-SQL via Abstraction},
  author={Abedini, Sepideh and Mohapatra, Shubhankar and Emerson, DB and Shafieinejad, Masoumeh and Cresswell, Jesse C and He, Xi},
  journal={arXiv preprint arXiv:2509.23459},
  year={2025}
}

Paper: https://arxiv.org/abs/2509.23459

About

MaskSQL: Safeguarding Privacy for LLM-Based Text-to-SQL via Abstraction

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.3%
  • Shell 3.7%