View Single Post
Old 11-03-2014, 08:12 PM
daniel0 daniel0 is offline
Junior Member
Join Date: Nov 2014
Posts: 5
Default Data Snooping with Test Set Inputs Intuition

Lecture 17 gives an example where test data is used to calculate means for pre-processing training data. It is indicated that doing so will bias the results such that the performance will be inflated when the model is tested on the test set.

It makes sense to me that test data should not be used at all for learning parameters of a model, including parameters for pre-processing. After all, when a model is used in production, the pre-processing parameters have to already exist, and can't be a function of online data.

However, I am having a difficult time understanding the intuition regarding the example from Lecture 17. Why is it that using test data to calculate means for normalizing the data, improves the performance when testing the model? It is more clear to me why the test scores would be inflated if say, the test labels were somehow incorporated into the training process (maybe by doing feature selection prior to splitting the data).

Reply With Quote