View Single Post
  #1  
Old 09-19-2012, 06:30 AM
Andrs Andrs is offline
Member
 
Join Date: Jul 2012
Posts: 47
Default Cross validation and scaling?

When using SVM/RBF provided by scikit-learn/LIBSVM, it is important that the data is scaled. My question is how should we scale (or standardize with zero mean and 1 variance) the data when using cross validation.
I have my training data D and I am dividing it based on k-fold cross validation. Here is a procedure:

1)first divide the data in "k-1 training folders" and "one test folder".
2)Perform a scaling operation on the test data(k-1 folders). It could be standardized(0,1)
3)Perform a scaling operation (based on the same parammeters) on the test folder. It could be standardized(0,1)
4)Train the classifier
5)CV-test
6)Go to (1) until all folders are used as test folders.
I would like to check the following statement:
Should we have different scaling operations for cv_training/test data (first split the data, second scale each data set separetly). Otherwise there is a risk for snooping and too optimistic E_cv. I think the Professor mentioned a subtile snooping case due to scaling both training and test data!
The other alternative is to scale the whole data set D and then perform cross validation---> snooping.
Reply With Quote