View Single Post
  #2  
Old 11-03-2014, 09:19 PM
daniel0 daniel0 is offline
Junior Member
 
Join Date: Nov 2014
Posts: 5
Default Re: Data Snooping with Test Set Inputs Intuition

I can think of a hyperbolic example where having access to test inputs could bias a trained model to perform well on the test data. For example, when training using the training model, observations that are nearby points from the test data may be given extra weight, to ensure the model learns to do well on the test data.

Any intuition for the original example would still be appreciated though. That is, some intuitive reason for an accuracy bias (positive accuracy bias in the lecture 17 example) when normalizing the training data using means and variances that were calculated using both train and test data. As the data set size grows, it seems like the issue would decrease in severity, since the means and variances of test data and training data would probabilistically become closer as n grows.

Like I mentioned, it is more clear to me why this is problematic if labels from the test set were used during the training process.

I have heard the same warning given regarding dimensionality reduction (I am not referring to feature selection, where test data labels are used, and I intuitively understand the consequences). In such case, the warning is the same: when doing PCA (or some other unsupervised dimensionality reduction), do the pre-processing just on the training data and use the parameters to reduce dimensions of test data during evaluation. I also have a hard time intuitively seeing why this would bias results one way or the other.
Reply With Quote