Preventing Data Leakage #5576

rehoyt · 2021-08-28T21:22:16Z

rehoyt
Aug 28, 2021

When we begin a classification model we usually start by splitting the data into train/test using the Data Sampler widget. We then send the train data to Test and Score as well as the test data by configuring the widget.

I understand that you should only perform scaling of the data and removing outliers on the training dataset to avoid data leakage but what about imputation of missing values and Select Columns where you may choose to ignore certain independent variables? Can that precede the Data Sampler widget or will that encourage data leakage? None of this is addressed in the tutorials or the widget help information. Below is a workflow that works but is it correct? Alternately, the second image below makes more sense where the data are divided by the Data Sampler and then both the training data and test data receive imputation and feature reduction separately but equally.

Based on the ebook Introduction to Data Mining by Zupan and Demsar pre-processing should occur at the cross validation phase so the Pre-Process widget is connected to Test and Score as shown below. They did not show how to run both train and test data so I inserted the Data Sampler widget as shown in the third image.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preventing Data Leakage #5576

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Preventing Data Leakage #5576

rehoyt Aug 28, 2021

Replies: 0 comments

rehoyt
Aug 28, 2021