LFD Book Forum  

Go Back   LFD Book Forum > General > General Discussion of Machine Learning

Reply
 
Thread Tools Display Modes
  #1  
Old 09-19-2012, 07:30 AM
Andrs Andrs is offline
Member
 
Join Date: Jul 2012
Posts: 47
Default Cross validation and scaling?

When using SVM/RBF provided by scikit-learn/LIBSVM, it is important that the data is scaled. My question is how should we scale (or standardize with zero mean and 1 variance) the data when using cross validation.
I have my training data D and I am dividing it based on k-fold cross validation. Here is a procedure:

1)first divide the data in "k-1 training folders" and "one test folder".
2)Perform a scaling operation on the test data(k-1 folders). It could be standardized(0,1)
3)Perform a scaling operation (based on the same parammeters) on the test folder. It could be standardized(0,1)
4)Train the classifier
5)CV-test
6)Go to (1) until all folders are used as test folders.
I would like to check the following statement:
Should we have different scaling operations for cv_training/test data (first split the data, second scale each data set separetly). Otherwise there is a risk for snooping and too optimistic E_cv. I think the Professor mentioned a subtile snooping case due to scaling both training and test data!
The other alternative is to scale the whole data set D and then perform cross validation---> snooping.
Reply With Quote
  #2  
Old 09-19-2012, 02:23 PM
htlin's Avatar
htlin htlin is offline
NTU
 
Join Date: Aug 2009
Location: Taipei, Taiwan
Posts: 601
Default Re: Cross validation and scaling?

Quote:
Originally Posted by Andrs View Post
When using SVM/RBF provided by scikit-learn/LIBSVM, it is important that the data is scaled. My question is how should we scale (or standardize with zero mean and 1 variance) the data when using cross validation.
I have my training data D and I am dividing it based on k-fold cross validation. Here is a procedure:

1)first divide the data in "k-1 training folders" and "one test folder".
2)Perform a scaling operation on the test data(k-1 folders). It could be standardized(0,1)
3)Perform a scaling operation (based on the same parammeters) on the test folder. It could be standardized(0,1)
4)Train the classifier
5)CV-test
6)Go to (1) until all folders are used as test folders.
I would like to check the following statement:
Should we have different scaling operations for cv_training/test data (first split the data, second scale each data set separetly). Otherwise there is a risk for snooping and too optimistic E_cv. I think the Professor mentioned a subtile snooping case due to scaling both training and test data!
The other alternative is to scale the whole data set D and then perform cross validation---> snooping.
It is a tricky question, and the bottom line is: Is scaling considered part of the learning procedure, or just "pre-processing"?

If scaling is pre-processing, scaling the whole training set is legitimate, not snooping. The E_{cv} you get would refelect an estimate of the test performance in the special, pre-processed space.

On the other hand, if scaling is part of learning, scaling should be done on the sub-training part instead. The E_{cv} will then be the estimated performance of (scale, train and then test).

There is no right or wrong for the two choices --- just different viewpoints. In my experience, the performance difference between the two choices (on a locked test set) is often rather negligible in practice, and hence we often see people consider scaling as "pre-processing" for its simplicity of implementation.

Hope this helps.
__________________
When one teaches, two learn.
Reply With Quote
Reply

Tags
cross-validation, snooping

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 10:57 PM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.