-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Look-Ahead Bias in Generated Features #1074
Comments
Hi @Nasser-Alkhulaifi - you are correct, look-ahead bias is not good. tsfresh comes with a toolkit for managing forecasting datasets (https://tsfresh.readthedocs.io/en/latest/text/forecasting.html), which allow you to define which data should be taken into account when calculating the features. I do not know how you used tsfresh but if you use the methods documented in the link, you should not get any look-ahead bias (because tsfresh just can not see the more recent data) |
Hi @nils-braun Apologies for the delayed response and thanks for sharing this. I understand the effectiveness of rolling windows in preventing look-ahead bias, but the need to manually specify parameters such as max_timeshift seems to contradict the goal of automated feature extraction. The requirement for users to determine these parameters manually introduces a level of complexity/user intervention that may not align with the intended ease-of-use and automation that TSfresh aims to provide. do you see what I mean or am I missing something here? so I'm just wondering; is there a possibility to incorporate a more dynamic approach within TSfresh to automatically determine these parameters, thus maintaining the ease-of-use and automation TSfresh aims to provide? Thank you again for this great package! |
Yes, I think I understand (although it is different from your first post, because this is on UX and not on a look-ahead bias - but that does not mean it is less important! So maybe my first answer was not relevant to your question), but I do think that the defaults of the methods are chosen in a way which allows for most users to not change them. Happy to learn more if you think this is not the case, but let me explain: Maybe there is a misunderstanding in how to use the function, so let me give some details.
Maybe I misunderstood, but as a summary I do think that you can use PS: if you are asking why we do not apply the |
Apologies for the delayed response @nils-braun and thank you for the detailed explanation - really appreciate it. I'll have a look and get back to you if I have any further questions/comments. |
Hi,
I've noticed that some of the generated features exhibit look-ahead bias, which is critical and must be avoided in machine learning regression problems. Specifically, the features in X_train contain exact values that represent the same row in y_train, leading to data leakage?
Example:
In the attached screenshot, you can see that X_train (features) includes values that are present in the same row as y_train. This creates look-ahead bias. Such features (e.g., lags or rolling statistical window features etc.) should be shifted to ensure only available data at the forecasting time is used for prediction.
Questions:
Why does this look-ahead bias exist in the generated features?
Am I using the tool incorrectly?
Is there a specific setting or method I am missing to avoid this issue?
Thank you.
The text was updated successfully, but these errors were encountered: