Data Snooping with Test Set Inputs Intuition
Lecture 17 gives an example where test data is used to calculate means for pre-processing training data. It is indicated that doing so will bias the results such that the performance will be inflated when the model is tested on the test set.
It makes sense to me that test data should not be used at all for learning parameters of a model, including parameters for pre-processing. After all, when a model is used in production, the pre-processing parameters have to already exist, and can't be a function of online data.
However, I am having a difficult time understanding the intuition regarding the example from Lecture 17. Why is it that using test data to calculate means for normalizing the data, improves the performance when testing the model? It is more clear to me why the test scores would be inflated if say, the test labels were somehow incorporated into the training process (maybe by doing feature selection prior to splitting the data).
Thanks,
Dan
|