View Single Post
Old 06-07-2012, 04:45 PM
magdon's Avatar
magdon magdon is offline
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 597
Default Re: Cross validation and data snooping

Unfortunately, you have to be very careful even here. If you only process the data by looking at the x-values then you have still snooped. Take the following simple example: I am going to dimensionally reduce the data from d to 2 dimensions by constructing 'unsupervised' features. These features will just try to represent the \mathbf{x}'s. In general you will construct 2-d features that can fairly well represent the training data and they may easily ignore the inputs on the test data - for example when you use your `training learned'-normalization on the test data it may project the test data to zero, and result in a non-sense prediction with high error. On the other hand, if you include the test data in constructing your dimensionality reduction, then the prediction on the test point will be different, taking into account information on the test set - yes, the x value is information on the test set.

You are not allowed to even see the test set before constructing your predictor using *only* your training set, if you want your test performance to be an unbiased estimate of your true out-of-sample performance.

Originally Posted by dudefromdayton View Post
Professor Magdon-Ismail's answer is great in the most general case, but I think you'll also want to look at what you want to do with machine learning.

If you're doing binary classification, and you define your data model with normalized outputs, such as +1 for peach marbles and -1 for plum marbles, and you encode your y's in this fashion, you haven't snooped.

And then if you normalize your x's for the same problem, but you only use these inputs for scaling and don't touch the y's, you still haven't snooped.

Where I see potential danger for scaling the input -- and it's not related to snooping, is that if you don't scale isotropically in the input components, you may change the relative sensitivity among these components. If you're going into an RBF kernel after that, the kernel's response to these separate components will be changed. So I'd add that caution, although I don't think it's snooping in a conventional sense. But it can be a data-dependent effect, so again, I'd be alert to it. I haven't noticed any mention of this possibility in any of our RBF lectures, notes, etc.
Have faith in probability
Reply With Quote