LFD Book Forum

LFD Book Forum (http://book.caltech.edu/bookforum/index.php)
-   Chapter 5 - Three Learning Principles (http://book.caltech.edu/bookforum/forumdisplay.php?f=112)
-   -   Cross validation and data snooping (http://book.caltech.edu/bookforum/showthread.php?t=616)

marcello 06-05-2012 02:47 AM

Cross validation and data snooping
 
To avoid data snooping, should we better leave out the cross validation subset when we normalize the data?
Cause I guess cv would be affected the same way as test data is, right?

So would it be better, for 10 fold cv, to scale data examining the 9/10 used for training, and the use the same scaling for the 1/10 left out for cv?
Would the results be comparable in that case, having 10 different scaling for the 10 different split?

magdon 06-05-2012 04:28 AM

Re: Cross validation and data snooping
 
Yes. Otherwise you have data snooped.

Quote:

Originally Posted by marcello (Post 2789)
To avoid data snooping, should we better leave out the cross validation subset when we normalize the data?
Cause I guess cv would be affected the same way as test data is, right?

So would it be better, for 10 fold cv, to scale data examining the 9/10 used for training, and the use the same scaling for the 1/10 left out for cv?
Would the results be comparable in that case, having 10 different scaling for the 10 different split?


dudefromdayton 06-05-2012 04:49 AM

Re: Cross validation and data snooping
 
Professor Magdon-Ismail's answer is great in the most general case, but I think you'll also want to look at what you want to do with machine learning.

If you're doing binary classification, and you define your data model with normalized outputs, such as +1 for peach marbles and -1 for plum marbles, and you encode your y's in this fashion, you haven't snooped.

And then if you normalize your x's for the same problem, but you only use these inputs for scaling and don't touch the y's, you still haven't snooped.

Where I see potential danger for scaling the input -- and it's not related to snooping, is that if you don't scale isotropically in the input components, you may change the relative sensitivity among these components. If you're going into an RBF kernel after that, the kernel's response to these separate components will be changed. So I'd add that caution, although I don't think it's snooping in a conventional sense. But it can be a data-dependent effect, so again, I'd be alert to it. I haven't noticed any mention of this possibility in any of our RBF lectures, notes, etc.

marcello 06-05-2012 08:51 AM

Re: Cross validation and data snooping
 
Thanks for the answers.

So suppose you are dealing with a classification problem and you're planning to use SVM with rbf kernels: your best shot would be not normalizing the data at all?

If I got it right when, instead, you do normalize the data (just the X, of course), you'd better scale every "fold" separately leaving out the cv set, but you are somehow changing the problem.

Another question comes into my mind: could aggregation be used to overcome this problem? Maybe, to prevent being excessivly misleaded, you could use SVM with scaled data, SVM with the original input and another classifier, and choose the answer that gets more votes?

Thanks again

marcello 06-05-2012 03:20 PM

Re: Cross validation and data snooping
 
One thing that makes me really suspicious about scaling non-isotropically on all features is the beaviour of SVM classifiers under Python scikit-learn package.
If I don't scale data, cross-validation is basically useless, cause just a couple of values for the error are generated (how likely is that?!?).

I wonder if it is a package bug, or if there is some countermeasure that should be taken:clueless:

markweitzman 06-05-2012 05:33 PM

Re: Cross validation and data snooping
 
I think with the data set we had - quite likely - remember there were only a finite number of points - given the values of Ein, Eout there were not that many points on which the measures could differ so there should only be a few cv values - I am writing from memory on this as the problem was a few weeks ago but my results on the scikit package were consistent with other packages including cvxopt and direct use of libsvm.

marcello 06-06-2012 05:10 AM

Re: Cross validation and data snooping
 
No, actually I'm testing on another DB, 5000+ records and a dozens features:
without scaling, Ecv and Eout are around 50%
With scaling, best I could do, is Eout at 16%

Still not satisfying, but the difference is impressing

magdon 06-07-2012 04:45 PM

Re: Cross validation and data snooping
 
Unfortunately, you have to be very careful even here. If you only process the data by looking at the x-values then you have still snooped. Take the following simple example: I am going to dimensionally reduce the data from d to 2 dimensions by constructing 'unsupervised' features. These features will just try to represent the \mathbf{x}'s. In general you will construct 2-d features that can fairly well represent the training data and they may easily ignore the inputs on the test data - for example when you use your `training learned'-normalization on the test data it may project the test data to zero, and result in a non-sense prediction with high error. On the other hand, if you include the test data in constructing your dimensionality reduction, then the prediction on the test point will be different, taking into account information on the test set - yes, the x value is information on the test set.

You are not allowed to even see the test set before constructing your predictor using *only* your training set, if you want your test performance to be an unbiased estimate of your true out-of-sample performance.


Quote:

Originally Posted by dudefromdayton (Post 2792)
Professor Magdon-Ismail's answer is great in the most general case, but I think you'll also want to look at what you want to do with machine learning.

If you're doing binary classification, and you define your data model with normalized outputs, such as +1 for peach marbles and -1 for plum marbles, and you encode your y's in this fashion, you haven't snooped.

And then if you normalize your x's for the same problem, but you only use these inputs for scaling and don't touch the y's, you still haven't snooped.

Where I see potential danger for scaling the input -- and it's not related to snooping, is that if you don't scale isotropically in the input components, you may change the relative sensitivity among these components. If you're going into an RBF kernel after that, the kernel's response to these separate components will be changed. So I'd add that caution, although I don't think it's snooping in a conventional sense. But it can be a data-dependent effect, so again, I'd be alert to it. I haven't noticed any mention of this possibility in any of our RBF lectures, notes, etc.



All times are GMT -7. The time now is 12:13 PM.

Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.