Observed behavior
Hi, there are bugs in classification-and-pca-lab.ipynb for Lab 6 in the do_classify and classify_from_dataframe methods. When standardizing the testing data, its mean and standard deviation are used. This is incorrect for several reasons such as:
- No information from the testing data should be used in the model prediction as it is a form of data snooping. The testing dataset has been contaminated by this.
- The same variable is not being created during the transformation of the training and testing sets
Expected behavior
The training data mean and standard deviation should be used for standardizing the testing data like so:
dftest=(subdf.iloc[itest] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()
Xte = (subdf.iloc[itest] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()
I think this was mentioned in one of the earlier lectures and here are some more references:
Observed behavior
Hi, there are bugs in classification-and-pca-lab.ipynb for
Lab 6in thedo_classifyandclassify_from_dataframemethods. When standardizing the testing data, its mean and standard deviation are used. This is incorrect for several reasons such as:Expected behavior
The training data mean and standard deviation should be used for standardizing the testing data like so:
I think this was mentioned in one of the earlier lectures and here are some more references: