Welcome to the TabSTAR repository! 👋
You can either use TabSTAR as a package for your own tabular data tasks, or explore the full repository for research purposes, including customized pretraining and replication of paper results.
🚧 The repository is under construction: Any bugs or feature request? Please open an issue! 🚧
- Paper: TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations
- Project Website: TabSTAR
Use this mode if you want to fit a pretrained TabSTAR model to your own dataset.
pip install tabstarUsing TabSTAR is as simple as this:
from importlib.resources import files
import pandas as pd
from sklearn.model_selection import train_test_split
from tabstar.tabstar_model import TabSTARClassifier
csv_path = files("tabstar").joinpath("resources", "imdb.csv")
x = pd.read_csv(csv_path)
y = x.pop('Genre_is_Drama')
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
tabstar = TabSTARClassifier()
tabstar.fit(x_train, y_train)
# tabstar.save("my_model_path.pkl")
# tabstar = TabSTARClassifier.load("my_model_path.pkl")
# y_pred = tabstar.predict(x_test)
metric = tabstar.score(X=x_test, y=y_test)
print(f"AUC: {metric:.4f}")Below is a template you can use to quickly get started with TabSTAR with your own data.
from pandas import DataFrame, Series
from sklearn.model_selection import train_test_split
from tabstar.tabstar_model import TabSTARClassifier, TabSTARRegressor
# --- USER-PROVIDED INPUTS ---
x_train = None # TODO: load your feature DataFrame here
y_train = None # TODO: load your target Series here
is_cls = None # TODO: True for classification, False for regression
x_test = None # TODO Optional: load your test feature DataFrame (or leave as None)
y_test = None # TODO Optional: load your test target Series (or leave as None)
# -----------------------------
# Sanity checks
assert isinstance(x_train, DataFrame), "x should be a pandas DataFrame"
assert isinstance(y_train, Series), "y should be a pandas Series"
assert isinstance(is_cls, bool), "is_cls should be a boolean indicating classification or regression"
if x_test is None:
assert y_test is None, "If x_test is None, y_test must also be None"
x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, test_size=0.1)
assert isinstance(x_test, DataFrame), "x_test should be a pandas DataFrame"
assert isinstance(y_test, Series), "y_test should be a pandas Series"
tabstar_cls = TabSTARClassifier if is_cls else TabSTARRegressor
tabstar = tabstar_cls()
tabstar.fit(x_train, y_train)
# tabstar.save("my_model_path.pkl")
# tabstar = TabSTARClassifier.load("my_model_path.pkl")
y_pred = tabstar.predict(x_test)
# metric = tabstar.score(X=x_test, y=y_test)Use this section when you want to pretrain, finetune, or evaluate TabSTAR on benchmarks. It assumes you are actively working on model development, experimenting with different datasets, or comparing against other methods.
After cloning the repo, run:
source init.shThis will install all necessary dependencies, set up your environment, and download any example data needed to get started.
If you want to evaluate TabSTAR on public datasets belonging to the paper's benchmark:
python tabstar_paper/do_benchmark.py --model=tabstar --dataset_id=<DATASET_ID>This script can over the 400 datasets in the paper, both for TabSTAR and other baselines presented in the paper.
The --dataset_id argument should be selected as the value of datasets appearing at tabstar/datasets/all_datasets.py.
To pretrain TabSTAR on a specified number of datasets:
python tabstar_paper/do_pretrain.py --n_datasets=256--n_datasets determines how many datasets to use for pretraining. You can reduce this number for quick debugging, but note this will harm downstream performance.
Once pretraining finishes, note the printed <PRETRAINED_EXP> identifier. Then run:
python tabstar_paper/do_finetune.py --pretrain_exp=<PRETRAINED_EXP> --dataset_id=46655--dataset_id is an ID for the downstream task you want to evaluate yourself on. Only the 400 datasets in the paper are supported.
If you use TabSTAR in your research, please cite:
@article{arazi2025tabstarf,
title = {TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations},
author = {Alan Arazi and Eilam Shapira and Roi Reichart},
journal = {arXiv preprint arXiv:2505.18125},
year = {2025},
}MIT

