LFD Book Forum  

Go Back   LFD Book Forum > Book Feedback - Learning From Data > Chapter 5 - Three Learning Principles

Reply
 
Thread Tools Display Modes
  #1  
Old 06-05-2012, 03:47 AM
marcello marcello is offline
Member
 
Join Date: Apr 2012
Posts: 35
Default Cross validation and data snooping

To avoid data snooping, should we better leave out the cross validation subset when we normalize the data?
Cause I guess cv would be affected the same way as test data is, right?

So would it be better, for 10 fold cv, to scale data examining the 9/10 used for training, and the use the same scaling for the 1/10 left out for cv?
Would the results be comparable in that case, having 10 different scaling for the 10 different split?
Reply With Quote
  #2  
Old 06-05-2012, 05:28 AM
magdon's Avatar
magdon magdon is offline
RPI
 
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 592
Default Re: Cross validation and data snooping

Yes. Otherwise you have data snooped.

Quote:
Originally Posted by marcello View Post
To avoid data snooping, should we better leave out the cross validation subset when we normalize the data?
Cause I guess cv would be affected the same way as test data is, right?

So would it be better, for 10 fold cv, to scale data examining the 9/10 used for training, and the use the same scaling for the 1/10 left out for cv?
Would the results be comparable in that case, having 10 different scaling for the 10 different split?
__________________
Have faith in probability
Reply With Quote
  #3  
Old 06-05-2012, 05:49 AM
dudefromdayton dudefromdayton is offline
Invited Guest
 
Join Date: Apr 2012
Posts: 140
Default Re: Cross validation and data snooping

Professor Magdon-Ismail's answer is great in the most general case, but I think you'll also want to look at what you want to do with machine learning.

If you're doing binary classification, and you define your data model with normalized outputs, such as +1 for peach marbles and -1 for plum marbles, and you encode your y's in this fashion, you haven't snooped.

And then if you normalize your x's for the same problem, but you only use these inputs for scaling and don't touch the y's, you still haven't snooped.

Where I see potential danger for scaling the input -- and it's not related to snooping, is that if you don't scale isotropically in the input components, you may change the relative sensitivity among these components. If you're going into an RBF kernel after that, the kernel's response to these separate components will be changed. So I'd add that caution, although I don't think it's snooping in a conventional sense. But it can be a data-dependent effect, so again, I'd be alert to it. I haven't noticed any mention of this possibility in any of our RBF lectures, notes, etc.
Reply With Quote
  #4  
Old 06-05-2012, 09:51 AM
marcello marcello is offline
Member
 
Join Date: Apr 2012
Posts: 35
Default Re: Cross validation and data snooping

Thanks for the answers.

So suppose you are dealing with a classification problem and you're planning to use SVM with rbf kernels: your best shot would be not normalizing the data at all?

If I got it right when, instead, you do normalize the data (just the X, of course), you'd better scale every "fold" separately leaving out the cv set, but you are somehow changing the problem.

Another question comes into my mind: could aggregation be used to overcome this problem? Maybe, to prevent being excessivly misleaded, you could use SVM with scaled data, SVM with the original input and another classifier, and choose the answer that gets more votes?

Thanks again
Reply With Quote
  #5  
Old 06-05-2012, 04:20 PM
marcello marcello is offline
Member
 
Join Date: Apr 2012
Posts: 35
Default Re: Cross validation and data snooping

One thing that makes me really suspicious about scaling non-isotropically on all features is the beaviour of SVM classifiers under Python scikit-learn package.
If I don't scale data, cross-validation is basically useless, cause just a couple of values for the error are generated (how likely is that?!?).

I wonder if it is a package bug, or if there is some countermeasure that should be taken
Reply With Quote
  #6  
Old 06-05-2012, 06:33 PM
markweitzman markweitzman is offline
Invited Guest
 
Join Date: Apr 2012
Location: Las Vegas
Posts: 69
Default Re: Cross validation and data snooping

I think with the data set we had - quite likely - remember there were only a finite number of points - given the values of Ein, Eout there were not that many points on which the measures could differ so there should only be a few cv values - I am writing from memory on this as the problem was a few weeks ago but my results on the scikit package were consistent with other packages including cvxopt and direct use of libsvm.
Reply With Quote
  #7  
Old 06-06-2012, 06:10 AM
marcello marcello is offline
Member
 
Join Date: Apr 2012
Posts: 35
Default Re: Cross validation and data snooping

No, actually I'm testing on another DB, 5000+ records and a dozens features:
without scaling, Ecv and Eout are around 50%
With scaling, best I could do, is Eout at 16%

Still not satisfying, but the difference is impressing
Reply With Quote
  #8  
Old 06-07-2012, 05:45 PM
magdon's Avatar
magdon magdon is offline
RPI
 
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 592
Default Re: Cross validation and data snooping

Unfortunately, you have to be very careful even here. If you only process the data by looking at the x-values then you have still snooped. Take the following simple example: I am going to dimensionally reduce the data from d to 2 dimensions by constructing 'unsupervised' features. These features will just try to represent the \mathbf{x}'s. In general you will construct 2-d features that can fairly well represent the training data and they may easily ignore the inputs on the test data - for example when you use your `training learned'-normalization on the test data it may project the test data to zero, and result in a non-sense prediction with high error. On the other hand, if you include the test data in constructing your dimensionality reduction, then the prediction on the test point will be different, taking into account information on the test set - yes, the x value is information on the test set.

You are not allowed to even see the test set before constructing your predictor using *only* your training set, if you want your test performance to be an unbiased estimate of your true out-of-sample performance.


Quote:
Originally Posted by dudefromdayton View Post
Professor Magdon-Ismail's answer is great in the most general case, but I think you'll also want to look at what you want to do with machine learning.

If you're doing binary classification, and you define your data model with normalized outputs, such as +1 for peach marbles and -1 for plum marbles, and you encode your y's in this fashion, you haven't snooped.

And then if you normalize your x's for the same problem, but you only use these inputs for scaling and don't touch the y's, you still haven't snooped.

Where I see potential danger for scaling the input -- and it's not related to snooping, is that if you don't scale isotropically in the input components, you may change the relative sensitivity among these components. If you're going into an RBF kernel after that, the kernel's response to these separate components will be changed. So I'd add that caution, although I don't think it's snooping in a conventional sense. But it can be a data-dependent effect, so again, I'd be alert to it. I haven't noticed any mention of this possibility in any of our RBF lectures, notes, etc.
__________________
Have faith in probability
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 08:51 PM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.