Foundry

Foundry is a package for forging interpretable predictive modeling pipelines with a sklearn style-API. It includes:

A Glm class with a Pytorch backend. This class is highly extensible, supporting (almost) any distribution in pytorch's distributions module.
A preprocessing module that includes helpful classes like DataFrameTransformer and InteractionFeatures.
An evaluation module with tools for interpreting any sklearn-API model via MarginalEffects.

You should use Foundry to augment your workflows if any of the following are true:

You are attempting to model a target that is 'weird': for example, highly skewed data, binomial count-data, censored or truncated data, etc.
You need some help battling some annoying aspects of feature-engineering: for example, you want an expressive way of specifying interaction-terms in your model; or perhaps you just want consistent support for getting feature-names despite being stuck on python 3.7.
You want to interpret your model: for example, perform statistical inference on its parameters, or understand the direction and functional-form of its predictors.

Getting Started

foundry can be installed with pip:

pip install git+https://github.com/strongio/foundry.git#egg=foundry

Let's walk through a quick example:

# data:
from foundry.data import get_click_data
# preprocessing:
from foundry.preprocessing import DataFrameTransformer, SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PowerTransformer
from sklearn.pipeline import make_pipeline
# glm:
from foundry.glm import Glm
# evaluation:
from foundry.evaluation import MarginalEffects

Here's a dataset of click user pageviews and clicks for domain with lots of pages:

df_train, df_val = get_click_data()
df_train

	attributed_source	user_agent_platform	page_id	page_market	page_feat1	page_feat2	page_feat3	num_clicks	num_views
0	8	Windows	7	b	0.0	0.0	35.0	0.0	32.0
1	8	Windows	7	b	0.0	1.0	0.0	0.0	14.0
2	8	Windows	7	a	0.0	0.0	5.0	0.0	8.0
3	8	Windows	7	a	0.0	0.0	9.0	0.0	7.0
4	8	Windows	7	a	0.0	0.0	20.0	0.0	40.0
...	...	...	...	...	...	...	...	...	...
423188	1	Android	95	f	0.0	0.0	25.0	0.0	1.0
423189	10	Android	26	a	0.0	2.0	7.0	15.0	860.0
423190	10	Android	32	a	0.0	0.0	36.0	37.0	651.0
423191	0	Other	10	b	0.0	0.0	26.0	0.0	1.0
423192	0	Other	31	a	0.0	1.0	34.0	0.0	1.0

423193 rows × 9 columns

We'd like to build a model that let's us predict future click-rates for different pages (page_id), page-attributes (e.g. market), and user-attributes (e.g. platform), and also learn about each of these features -- e.g. perform statistical inference on model-coefficients ("are users with missing user-agent data significantly worse than average?")

Unfortunately, these data don't fit nicely into the typical regression/classification divide: each observations captures counts of clicks and counts of pageviews. Our target is the click-rate (clicks/views) and our sample-weight is the pageviews.

One workaround would be to expand our dataset so that each row indicates is_click (True/False) -- then we could use a standard classification algorithm:

df_train_expanded, df_val_expanded = get_click_data(expanded=True)
df_train_expanded

	attributed_source	user_agent_platform	page_id	page_market	page_feat1	page_feat2	page_feat3	is_click
0	8	Windows	67	b	0.0	1.0	0.0	False
1	8	Windows	67	b	0.0	1.0	0.0	False
2	8	Windows	67	b	0.0	1.0	0.0	False
3	8	Windows	67	b	0.0	1.0	0.0	False
4	8	Windows	67	b	0.0	1.0	0.0	False
...	...	...	...	...	...	...	...	...
7760666	7	OSX	61	c	3.0	1.0	12.0	False
7760667	7	OSX	61	c	3.0	1.0	12.0	False
7760668	7	OSX	61	c	3.0	1.0	12.0	False
7760669	7	OSX	61	c	3.0	1.0	12.0	False
7760670	7	OSX	61	c	3.0	1.0	12.0	False

7760671 rows × 8 columns

But this is hugely inefficient: our dataset of ~400K explodes to almost 8MM.

Within foundry, we have the Glm, which supports binomial data directly:

Glm('binomial', penalty=10_000)

Glm(family='binomial', penalty=10000)

Let's set up a sklearn model pipeline using this Glm. We'll use foundry's DataFrameTransformer to support passing feature-names to the Glm (newer versions of sklearn support this via the set_output() API).

preproc = DataFrameTransformer([
    (
        'one_hot', 
        make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder()), 
        ['attributed_source', 'user_agent_platform', 'page_id', 'page_market']
    )
    ,
    (
        'power', 
        PowerTransformer(),
        ['page_feat1', 'page_feat2', 'page_feat3']
    )
])

glm = make_pipeline(
    preproc, 
    Glm('binomial', penalty=1_000)
).fit(
    X=df_train,
    y={
        'value' : df_train['num_clicks'],
        'total_count' : df_train['num_views']
    },
)

Epoch 8; Loss 0.3183; Convergence 0.0003131/0.001:  42%|█████▊        | 5/12 [00:00<00:00, 10.99it/s]

Estimating laplace coefs... (you can safely keyboard-interrupt to cancel)


Epoch 8; Loss 0.3183; Convergence 0.0003131/0.001:  42%|█████▊        | 5/12 [00:07<00:10,  1.55s/it]

By default, the Glm will estimate not just the parameters of our model, but also the uncertainty associated with them. We can access a dataframe of these with the coef_dataframe_ attribute:

df_coefs = glm[-1].coef_dataframe_
df_coefs

	name	estimate	se
0	probs__one_hot__attributed_source_0	0.000042	0.031622
1	probs__one_hot__attributed_source_1	-0.003277	0.031578
2	probs__one_hot__attributed_source_2	-0.058870	0.030623
3	probs__one_hot__attributed_source_3	-0.485669	0.024011
4	probs__one_hot__attributed_source_4	-0.663989	0.016975
...	...	...	...
141	probs__one_hot__page_market_z	0.353556	0.025317
142	probs__power__page_feat1	0.213486	0.002241
143	probs__power__page_feat2	0.724601	0.004021
144	probs__power__page_feat3	0.913425	0.004974
145	probs__bias	-5.166077	0.022824

146 rows × 3 columns

Using this, it's easy to plot our model-coefficients:

df_coefs[['param', 'trans', 'term']] = df_coefs['name'].str.split('__', n=3, expand=True)

df_coefs[df_coefs['name'].str.contains('page_feat')].plot('term', 'estimate', kind='bar', yerr='se')
df_coefs[df_coefs['name'].str.contains('user_agent_platform')].plot('term', 'estimate', kind='bar', yerr='se')

<AxesSubplot:xlabel='term'>

Model-coefficients are limited because they only give us a single number, and for non-linear models (like our binomial GLM) this doesn't tell the whole story. For example, how could we translate the importance of page_feat3 into understanable terms? This only gets more difficult if our model includes interaction-terms.

To aid in this, there is MarginalEffects, a tool for plotting our model-predictions as a function of each predictor:

glm_me = MarginalEffects(glm)
glm_me.fit(
    X=df_val_expanded, 
    y=df_val_expanded['is_click'],
    vary_features=['page_feat3']
).plot()

<ggplot: (8777751556441)>

Here we see that how this predictor's impact on click-rates varies due to floor effects.

As a bonus, we plotted the actual values alongside the predictions, and we can see potential room for improvement in our model: it looks like very high values of this predictor have especially high click-rates, so an extra step in feature-engineering that captures this discontinuity may be warranted.

Name		Name	Last commit message	Last commit date
Latest commit History 276 Commits
README_files		README_files
docs		docs
foundry		foundry
tests		tests
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Foundry

Getting Started

About

Releases 4

Packages

Contributors 3

Languages

License

strongio/foundry

Folders and files

Latest commit

History

Repository files navigation

Foundry

Getting Started

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 3

Languages

Packages