You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we begin a classification model we usually start by splitting the data into train/test using the Data Sampler widget. We then send the train data to Test and Score as well as the test data by configuring the widget.
I understand that you should only perform scaling of the data and removing outliers on the training dataset to avoid data leakage but what about imputation of missing values and Select Columns where you may choose to ignore certain independent variables? Can that precede the Data Sampler widget or will that encourage data leakage? None of this is addressed in the tutorials or the widget help information. Below is a workflow that works but is it correct? Alternately, the second image below makes more sense where the data are divided by the Data Sampler and then both the training data and test data receive imputation and feature reduction separately but equally.
Based on the ebook Introduction to Data Mining by Zupan and Demsar pre-processing should occur at the cross validation phase so the Pre-Process widget is connected to Test and Score as shown below. They did not show how to run both train and test data so I inserted the Data Sampler widget as shown in the third image.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
When we begin a classification model we usually start by splitting the data into train/test using the Data Sampler widget. We then send the train data to Test and Score as well as the test data by configuring the widget.
I understand that you should only perform scaling of the data and removing outliers on the training dataset to avoid data leakage but what about imputation of missing values and Select Columns where you may choose to ignore certain independent variables? Can that precede the Data Sampler widget or will that encourage data leakage? None of this is addressed in the tutorials or the widget help information. Below is a workflow that works but is it correct? Alternately, the second image below makes more sense where the data are divided by the Data Sampler and then both the training data and test data receive imputation and feature reduction separately but equally.
Based on the ebook Introduction to Data Mining by Zupan and Demsar pre-processing should occur at the cross validation phase so the Pre-Process widget is connected to Test and Score as shown below. They did not show how to run both train and test data so I inserted the Data Sampler widget as shown in the third image.
Beta Was this translation helpful? Give feedback.
All reactions