Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notes for time-shifting data #15

Open
2 tasks done
emilycantrell opened this issue May 1, 2024 · 7 comments
Open
2 tasks done

Notes for time-shifting data #15

emilycantrell opened this issue May 1, 2024 · 7 comments

Comments

@emilycantrell
Copy link
Collaborator

emilycantrell commented May 1, 2024

  • Make sure to add a year that indicates the actual year the data came from Update: since we are only time-shifting from one time period right now (2018-2020), I created a binary indicator of whether the data is time-shifted
  • Make sure year variables are shifted correctly
    • So far I applied shifting to the following age/year variables: birthyear_bg, age_bg, cf20m029. If we add other year-related variables to our model, we can adjust them on a case-by-case basis.
@emilycantrell
Copy link
Collaborator Author

emilycantrell commented May 3, 2024

  • Make sure to restrict the data to only people of the right age

@emilycantrell
Copy link
Collaborator Author

emilycantrell commented May 3, 2024

  • Make sure to remove any references to specific ID numbers that are manually written in the outcome calculation code before posting to the repo

@emilycantrell
Copy link
Collaborator Author

  • Consider whether to do anything special for features that ask about specific political figures, e.g., "What do you think of Jan Peter Balkenende?" (a question about a specific person is usually asked in only a small number of waves)

@emilycantrell
Copy link
Collaborator Author

emilycantrell commented May 8, 2024

  • Adjust time-shifted income or money related variables for inflation
    • Update: I applied inflation to nettohh_f_2020 since we are using it in our model. We can apply inflation to any other money-related variables we add to our model on a case-by-case basis.

@HanzhangRen
Copy link
Collaborator

There is an additional feature among the one I handpicked that requires special attention: cf20m026 is about the partner's birthyear and should be time sifted in similar ways to birthyears and cohabitation start years. I implemented a piece of code that does that.

@HanzhangRen
Copy link
Collaborator

HanzhangRen commented May 13, 2024

@emilycantrell TL; DR: After including time-shifted data, the maximum cross-validation F1 score increased only slightly from 0.769 to 0.773.

I compared the cv results of the current version of the code to those in the following two files (R converted to txt).

training_no_shift.txt
submission_no_shift.txt

The differences between the "no shift" code and our current code are as follows:

  1. The "no shift" code does not have the time-shifted data.
  2. The "no shift" code has fewer predictors because I felt that given the data limitations, it's best to combine or delete some categories in some categorical variables.
  3. The hyperparameter grids are slightly different because I constructed each of them somewhat independently in a largely intuition-based, iterative process, but I don't think this changes things by a lot.

Here are the top pipelines from the "no shift" code:
Screenshot 2024-05-12 at 10 23 33 AM

Here are the top pipelines from the current code:
Screenshot 2024-05-12 at 10 26 46 PM

As you can see, after including time-shifted data, the maximum cross-validation F1 score increased only slightly from 0.769 to 0.773, and there's significant uncertainties surrounding both estimates.

I am in fact pleasantly surprised by these results, because I half expected F1 scores to significantly go down because we are now training with out-of-sample data. It is quite impressive that even though our training data is now less representative of our test data in many ways, we still achieve similar, if not better, predictions.

Perhaps this means that our new model is more robust to moving from one dataset to another, and we will see better performance on the holdout set even though the time-shift does not show much improvement in terms of cross validation in the training set.

An additional thing that is interesting is that with time-shifted data, the winning pipeline seems to involve only tree stumps, i. e. trees with only one split and a depth of 1. This is surprising to me because stumps are not able to capture interactions between variables in the data (though multiple stumps can capture nonlinearities by making splits in different places). I changed the seed from 0 to 1 and 2 and ran the code again twice, and still it was stumps that won for both seeds. This makes me a little concerned about whether there is something wrong in the code, but perhaps our datasets are simply too small for considerations of interaction effects to be useful.

@emilycantrell
Copy link
Collaborator Author

@HanzhangRen Thank you for this side by side comparison!! Here are some reactions:

  • My first reaction was disappointment that adding so much more data made almost no difference! I'm still a bit disappointed, but your statement about being pleasantly surprised that at least the out-of-distribution data didn't make things worse is fair.
  • Open question: Why is our F1 score here is so much higher than what we scored on the leaderboard? Previously I thought maybe we were getting a higher F1 score via cross-validation because of household-based leakage in the cross-validation. Now I suppose the most likely reason for the discrepancy between the cross-validated F1 score and the leaderboard F1 score is winner's curse. (But based on the screenshot, I don't think this is a "theranos winner" situation, just a more run-of-the-mill winner's curse.) I'll think more on this because I'm still wondering if something else is going on to explain the F1 score discrepancy.
  • The use of only tree stumps is surprising to me. I need to think on this and read more about XGboost.
  • I wonder whether the variable about when they expect have kids is a super important stump. Does XGboost give an easy way to see feature importance? If you're not sure off the top of your head, I'll look into this later this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants