![]() |
|
#1
|
|||
|
|||
![]()
To avoid data snooping, should we better leave out the cross validation subset when we normalize the data?
Cause I guess cv would be affected the same way as test data is, right? So would it be better, for 10 fold cv, to scale data examining the 9/10 used for training, and the use the same scaling for the 1/10 left out for cv? Would the results be comparable in that case, having 10 different scaling for the 10 different split? |
#2
|
||||
|
||||
![]()
Yes. Otherwise you have data snooped.
Quote:
__________________
Have faith in probability |
#3
|
|||
|
|||
![]()
Professor Magdon-Ismail's answer is great in the most general case, but I think you'll also want to look at what you want to do with machine learning.
If you're doing binary classification, and you define your data model with normalized outputs, such as +1 for peach marbles and -1 for plum marbles, and you encode your y's in this fashion, you haven't snooped. And then if you normalize your x's for the same problem, but you only use these inputs for scaling and don't touch the y's, you still haven't snooped. Where I see potential danger for scaling the input -- and it's not related to snooping, is that if you don't scale isotropically in the input components, you may change the relative sensitivity among these components. If you're going into an RBF kernel after that, the kernel's response to these separate components will be changed. So I'd add that caution, although I don't think it's snooping in a conventional sense. But it can be a data-dependent effect, so again, I'd be alert to it. I haven't noticed any mention of this possibility in any of our RBF lectures, notes, etc. |
#4
|
|||
|
|||
![]()
Thanks for the answers.
So suppose you are dealing with a classification problem and you're planning to use SVM with rbf kernels: your best shot would be not normalizing the data at all? If I got it right when, instead, you do normalize the data (just the X, of course), you'd better scale every "fold" separately leaving out the cv set, but you are somehow changing the problem. Another question comes into my mind: could aggregation be used to overcome this problem? Maybe, to prevent being excessivly misleaded, you could use SVM with scaled data, SVM with the original input and another classifier, and choose the answer that gets more votes? Thanks again |
#5
|
|||
|
|||
![]()
One thing that makes me really suspicious about scaling non-isotropically on all features is the beaviour of SVM classifiers under Python scikit-learn package.
If I don't scale data, cross-validation is basically useless, cause just a couple of values for the error are generated (how likely is that?!?). I wonder if it is a package bug, or if there is some countermeasure that should be taken ![]() |
#6
|
|||
|
|||
![]()
I think with the data set we had - quite likely - remember there were only a finite number of points - given the values of Ein, Eout there were not that many points on which the measures could differ so there should only be a few cv values - I am writing from memory on this as the problem was a few weeks ago but my results on the scikit package were consistent with other packages including cvxopt and direct use of libsvm.
|
#7
|
||||
|
||||
![]()
Unfortunately, you have to be very careful even here. If you only process the data by looking at the x-values then you have still snooped. Take the following simple example: I am going to dimensionally reduce the data from
![]() ![]() You are not allowed to even see the test set before constructing your predictor using *only* your training set, if you want your test performance to be an unbiased estimate of your true out-of-sample performance. Quote:
__________________
Have faith in probability |
![]() |
Thread Tools | |
Display Modes | |
|
|