Skip to content

TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations

License

Notifications You must be signed in to change notification settings

cmarkello/TabSTAR

 
 

Repository files navigation

TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations

TabSTAR Logo

PyPI version Python Versions PyPI Downloads GitHub license arXiv

Welcome to the TabSTAR repository! 👋
You can either use TabSTAR as a package for your own tabular data tasks, or explore the full repository for research purposes, including customized pretraining and replication of paper results.

🚧 The repository is under construction: Any bugs or feature request? Please open an issue! 🚧


📚 Resources

TabSTAR Logo


Package Mode

Use this mode if you want to fit a pretrained TabSTAR model to your own dataset.

Installation

pip install tabstar

Inference Example

Using TabSTAR is as simple as this:

from importlib.resources import files
import pandas as pd
from sklearn.model_selection import train_test_split

from tabstar.tabstar_model import TabSTARClassifier

csv_path = files("tabstar").joinpath("resources", "imdb.csv")
x = pd.read_csv(csv_path)
y = x.pop('Genre_is_Drama')
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
tabstar = TabSTARClassifier()
tabstar.fit(x_train, y_train)
# tabstar.save("my_model_path.pkl")
# tabstar = TabSTARClassifier.load("my_model_path.pkl")
# y_pred = tabstar.predict(x_test)
metric = tabstar.score(X=x_test, y=y_test)
print(f"AUC: {metric:.4f}")

Below is a template you can use to quickly get started with TabSTAR with your own data.

from pandas import DataFrame, Series
from sklearn.model_selection import train_test_split

from tabstar.tabstar_model import TabSTARClassifier, TabSTARRegressor

# --- USER-PROVIDED INPUTS ---
x_train = None  # TODO: load your feature DataFrame here
y_train = None  # TODO: load your target Series here
is_cls = None   # TODO: True for classification, False for regression
x_test = None   # TODO Optional: load your test feature DataFrame (or leave as None)
y_test = None   # TODO Optional: load your test target Series (or leave as None)
# -----------------------------

# Sanity checks
assert isinstance(x_train, DataFrame), "x should be a pandas DataFrame"
assert isinstance(y_train, Series), "y should be a pandas Series"
assert isinstance(is_cls, bool), "is_cls should be a boolean indicating classification or regression"

if x_test is None:
    assert y_test is None, "If x_test is None, y_test must also be None"
    x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, test_size=0.1)

assert isinstance(x_test, DataFrame), "x_test should be a pandas DataFrame"
assert isinstance(y_test, Series), "y_test should be a pandas Series"

tabstar_cls = TabSTARClassifier if is_cls else TabSTARRegressor
tabstar = tabstar_cls()
tabstar.fit(x_train, y_train)
# tabstar.save("my_model_path.pkl")
# tabstar = TabSTARClassifier.load("my_model_path.pkl")
y_pred = tabstar.predict(x_test)
# metric = tabstar.score(X=x_test, y=y_test)

Research Mode

Use this section when you want to pretrain, finetune, or evaluate TabSTAR on benchmarks. It assumes you are actively working on model development, experimenting with different datasets, or comparing against other methods.

Installation

After cloning the repo, run:

source init.sh

This will install all necessary dependencies, set up your environment, and download any example data needed to get started.

Benchmark Evaluation

If you want to evaluate TabSTAR on public datasets belonging to the paper's benchmark:

python tabstar_paper/do_benchmark.py --model=tabstar --dataset_id=<DATASET_ID>

This script can over the 400 datasets in the paper, both for TabSTAR and other baselines presented in the paper. The --dataset_id argument should be selected as the value of datasets appearing at tabstar/datasets/all_datasets.py.

Pretraining

To pretrain TabSTAR on a specified number of datasets:

python tabstar_paper/do_pretrain.py --n_datasets=256

--n_datasets determines how many datasets to use for pretraining. You can reduce this number for quick debugging, but note this will harm downstream performance.

Finetuning

Once pretraining finishes, note the printed <PRETRAINED_EXP> identifier. Then run:

python tabstar_paper/do_finetune.py --pretrain_exp=<PRETRAINED_EXP> --dataset_id=46655

--dataset_id is an ID for the downstream task you want to evaluate yourself on. Only the 400 datasets in the paper are supported.

Citation

If you use TabSTAR in your research, please cite:

@article{arazi2025tabstarf,
  title   = {TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations},
  author  = {Alan Arazi and Eilam Shapira and Roi Reichart},
  journal = {arXiv preprint arXiv:2505.18125},
  year    = {2025},
}

License

MIT

❤️ Contributors

langflow contributors

About

TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%