Notes for time-shifting data #15

emilycantrell · 2024-05-01T18:06:52Z

Make sure to add a year that indicates the actual year the data came from Update: since we are only time-shifting from one time period right now (2018-2020), I created a binary indicator of whether the data is time-shifted
Make sure year variables are shifted correctly
- So far I applied shifting to the following age/year variables: birthyear_bg, age_bg, cf20m029. If we add other year-related variables to our model, we can adjust them on a case-by-case basis.

emilycantrell · 2024-05-03T01:44:33Z

Make sure to restrict the data to only people of the right age

emilycantrell · 2024-05-03T14:57:42Z

Make sure to remove any references to specific ID numbers that are manually written in the outcome calculation code before posting to the repo

emilycantrell · 2024-05-07T17:05:27Z

Consider whether to do anything special for features that ask about specific political figures, e.g., "What do you think of Jan Peter Balkenende?" (a question about a specific person is usually asked in only a small number of waves)

emilycantrell · 2024-05-08T19:49:44Z

Adjust time-shifted income or money related variables for inflation
- Update: I applied inflation to nettohh_f_2020 since we are using it in our model. We can apply inflation to any other money-related variables we add to our model on a case-by-case basis.

HanzhangRen · 2024-05-13T19:03:14Z

There is an additional feature among the one I handpicked that requires special attention: cf20m026 is about the partner's birthyear and should be time sifted in similar ways to birthyears and cohabitation start years. I implemented a piece of code that does that.

HanzhangRen · 2024-05-13T21:36:52Z

@emilycantrell TL; DR: After including time-shifted data, the maximum cross-validation F1 score increased only slightly from 0.769 to 0.773.

I compared the cv results of the current version of the code to those in the following two files (R converted to txt).

training_no_shift.txt
submission_no_shift.txt

The differences between the "no shift" code and our current code are as follows:

The "no shift" code does not have the time-shifted data.
The "no shift" code has fewer predictors because I felt that given the data limitations, it's best to combine or delete some categories in some categorical variables.
The hyperparameter grids are slightly different because I constructed each of them somewhat independently in a largely intuition-based, iterative process, but I don't think this changes things by a lot.

Here are the top pipelines from the "no shift" code:

Here are the top pipelines from the current code:

As you can see, after including time-shifted data, the maximum cross-validation F1 score increased only slightly from 0.769 to 0.773, and there's significant uncertainties surrounding both estimates.

I am in fact pleasantly surprised by these results, because I half expected F1 scores to significantly go down because we are now training with out-of-sample data. It is quite impressive that even though our training data is now less representative of our test data in many ways, we still achieve similar, if not better, predictions.

Perhaps this means that our new model is more robust to moving from one dataset to another, and we will see better performance on the holdout set even though the time-shift does not show much improvement in terms of cross validation in the training set.

An additional thing that is interesting is that with time-shifted data, the winning pipeline seems to involve only tree stumps, i. e. trees with only one split and a depth of 1. This is surprising to me because stumps are not able to capture interactions between variables in the data (though multiple stumps can capture nonlinearities by making splits in different places). I changed the seed from 0 to 1 and 2 and ran the code again twice, and still it was stumps that won for both seeds. This makes me a little concerned about whether there is something wrong in the code, but perhaps our datasets are simply too small for considerations of interaction effects to be useful.

emilycantrell · 2024-05-14T01:01:30Z

@HanzhangRen Thank you for this side by side comparison!! Here are some reactions:

My first reaction was disappointment that adding so much more data made almost no difference! I'm still a bit disappointed, but your statement about being pleasantly surprised that at least the out-of-distribution data didn't make things worse is fair.
Open question: Why is our F1 score here is so much higher than what we scored on the leaderboard? Previously I thought maybe we were getting a higher F1 score via cross-validation because of household-based leakage in the cross-validation. Now I suppose the most likely reason for the discrepancy between the cross-validated F1 score and the leaderboard F1 score is winner's curse. (But based on the screenshot, I don't think this is a "theranos winner" situation, just a more run-of-the-mill winner's curse.) I'll think more on this because I'm still wondering if something else is going on to explain the F1 score discrepancy.
The use of only tree stumps is surprising to me. I need to think on this and read more about XGboost.
I wonder whether the variable about when they expect have kids is a super important stump. Does XGboost give an easy way to see feature importance? If you're not sure off the top of your head, I'll look into this later this week.

emilycantrell added a commit that referenced this issue May 8, 2024

Add indicator for time-shifted data #15

f985cbc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes for time-shifting data #15

Notes for time-shifting data #15

emilycantrell commented May 1, 2024 •

edited

Loading

emilycantrell commented May 3, 2024 •

edited

Loading

emilycantrell commented May 3, 2024 •

edited

Loading

emilycantrell commented May 7, 2024

emilycantrell commented May 8, 2024 •

edited

Loading

HanzhangRen commented May 13, 2024

HanzhangRen commented May 13, 2024 •

edited

Loading

emilycantrell commented May 14, 2024

Notes for time-shifting data #15

Notes for time-shifting data #15

Comments

emilycantrell commented May 1, 2024 • edited Loading

emilycantrell commented May 3, 2024 • edited Loading

emilycantrell commented May 3, 2024 • edited Loading

emilycantrell commented May 7, 2024

emilycantrell commented May 8, 2024 • edited Loading

HanzhangRen commented May 13, 2024

HanzhangRen commented May 13, 2024 • edited Loading

emilycantrell commented May 14, 2024

emilycantrell commented May 1, 2024 •

edited

Loading

emilycantrell commented May 3, 2024 •

edited

Loading

emilycantrell commented May 3, 2024 •

edited

Loading

emilycantrell commented May 8, 2024 •

edited

Loading

HanzhangRen commented May 13, 2024 •

edited

Loading