Final project in 'Tabular Data Science' course by Dr. Amit Somech at Bar-Ilan University.
There are two notebooks in the notebooks/
directory.
The notebook eda_example.ipynb
is an example of how to use our ConceptDriftsFinder tool as part of the EDA process.
The notebook automatic_feature_engineeirng.ipynb
uses a simple heuristic to try and automatically utilize our
ConceptsDriftFinder
tool for feature engineering.
We test it on 4 different datasets, described below and in the accompanying pdf.
In the Exploratory Data Analysis process (EDA), ConceptsDriftsFinder
can be used to automatically find concept drifts.
The find_concept_drifts
function receives a list of transactions and returns a list of ConceptDriftResult
objects.
Let's say we suspect the column OverallQual
as a concept drift, we can run:
ConceptDriftsFinder().find_concept_drifts(transactions, concept_column="OverallQual", target_column="SalePrice")
Here is a sample of the output in a table format:
left_hand_side | right_hand_side | confidence_before | confidence_after | support_before | support_after | lift_before | lift_after | concept_cutoff | concept_column | |
---|---|---|---|---|---|---|---|---|---|---|
0 | {'BldgType': '1Fam'} | {'SalePrice': 1} | 1 | 0.185229 | 1 | 0.155359 | 1 | 0.941887 | 2.8 | OverallQual |
1 | {'FullBath': 1} | {'SalePrice': 1} | 1 | 0.374718 | 0.6 | 0.163225 | 1 | 1.90544 | 2.8 | OverallQual |
2 | {'GrLivArea': 1} | {'SalePrice': 1} | 1 | 0.53 | 1 | 0.104228 | 1 | 2.69505 | 2.8 | OverallQual |
3 | {'YearBuilt': 1} | {'SalePrice': 1} | 1 | 0.509615 | 0.8 | 0.104228 | 1 | 2.59139 | 2.8 | OverallQual |
We can see that if OverallQual<2.8
then BldgType=1Fam
becomes a more important indication of SalePrice
.
This is useful to better understand our dataset.
See the following section about preprocessing your data before you can use ConceptDriftsFinder
.
The amount of drifts found can be controlled with the following parameters: min_confidence: float
,
min_support: float
, diff_threshold: float
(between lift values).
A pandas.DataFrame
object can be converted to transactions using the helper function convert_df_to_transactions
.
When working with association rules, we can't use numerical values, only categorical or ordinal.
preprocessing.py
contains code to convert numerical
values to ordinals, as well as data cleaning such as filling N/A with average values or dropping N/A completely.
This is an internal class and shouldn't be used directly.
The CutoffValuesFinder
classifies each concept as discrete or continuous, and based on that it decides for
ConceptDriftsFinder
which concept values to try.
To easily start using this library for feature engineering for machine learning models, we created ConceptEngineering
.
The idea is to automatically take the found concept drifts into account by changing the weights of the features in the dataset.
Let's look at the following row from our dataset (shortened for readability):
Id | 1101 |
---|---|
YearBuilt | 1 |
FullBath | 1 |
OverallQual | 2 |
BldgType_1Fam | 1 |
BldgType_2fmCon | 0 |
BldgType_Duplex | 0 |
If we continue with our example above, we know that if OverallQual<2.8
then BldgType=1Fam
becomes a more
important indication for SalePrice=1
.
We can help the model use this information by increasing the weight of BldgType=1Fam
whenever OverallQual<2.8
, which
can be especially helpful when using models such as LogisticRegression which have a single weight per feature.
We can run:
# Find association rules
df_prep, train_params = preprocess_dataset(df)
X, y = split_X_y(df_prep, columns, train_params, one_hot_columns, target_column)
concept_engineering = ConceptEngineering()
X = concept_engineering.fit_transform(X, df, target_column, one_hot_columns)
Now our new X
dataframe will have increased weights based on all of the found rules, for example:
Id | 1101 |
---|---|
YearBuilt | 1.37681 |
FullBath | 1.2922 |
OverallQual | 2 |
BldgType_1Fam | 0.94262 |
BldgType_2fmCon | 0 |
BldgType_Duplex | 0 |
To calculate the change to the value, we use the difference between the lift values.
This is an internal class and shouldn't be used directly.
This is a simple scikit-learn LogisticRegression
implementation with one-vs-rest (ovr) (i.e., one model for each label).
The only difference in this implementation (instead of using multi_class="ovr"
) is that we can inject different
dataset changes for each label classifier (called label_to_transformation
).
This is necessary, because concept rules are per label, and we want to activate the label rules only when classifying the relevant label.
To test our library, we used 4 datasets: