![]() |
#1
|
|||
|
|||
![]()
When using SVM/RBF provided by scikit-learn/LIBSVM, it is important that the data is scaled. My question is how should we scale (or standardize with zero mean and 1 variance) the data when using cross validation.
I have my training data D and I am dividing it based on k-fold cross validation. Here is a procedure: 1)first divide the data in "k-1 training folders" and "one test folder". 2)Perform a scaling operation on the test data(k-1 folders). It could be standardized(0,1) 3)Perform a scaling operation (based on the same parammeters) on the test folder. It could be standardized(0,1) 4)Train the classifier 5)CV-test 6)Go to (1) until all folders are used as test folders. I would like to check the following statement: Should we have different scaling operations for cv_training/test data (first split the data, second scale each data set separetly). Otherwise there is a risk for snooping and too optimistic E_cv. I think the Professor mentioned a subtile snooping case due to scaling both training and test data! The other alternative is to scale the whole data set D and then perform cross validation---> snooping. |
#2
|
||||
|
||||
![]() Quote:
If scaling is pre-processing, scaling the whole training set is legitimate, not snooping. The ![]() On the other hand, if scaling is part of learning, scaling should be done on the sub-training part instead. The ![]() There is no right or wrong for the two choices --- just different viewpoints. In my experience, the performance difference between the two choices (on a locked test set) is often rather negligible in practice, and hence we often see people consider scaling as "pre-processing" for its simplicity of implementation. Hope this helps.
__________________
When one teaches, two learn. |
![]() |
Tags |
cross-validation, snooping |
Thread Tools | |
Display Modes | |
|
|