View Single Post
  #3  
Old 12-05-2014, 06:24 AM
magdon's Avatar
magdon magdon is offline
RPI
 
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 595
Default Re: Snooping and unsupervised preprocessing

The degree of snooping can depend on the nature of the unsupervised preprocessinig, and so to be safe you must leave your points out even before unsupervised filtering. It is safest to adhere to the first part of the quote you mentioned from the Hastie book:

“In general, with a multistep modeling procedure, cross-validation must
be applied to the entire sequence of modeling steps. In particular, samples
must be “left out” before any selection or filtering steps are applied."

Interestingly, the specific example you mentioned about PCA dimension reduction is considered in Problem 9.10 of e-Chapter 9 which is posted on this forum. If you perform the experiment, you will find that this particular form of unsupervised input snooping can significantly bias your LOO-CV estimate of Eout.


Quote:
Originally Posted by Don Mathis View Post
In a supervised classification setting, is it data-snooping to perform unsupervised preprocessing on your complete dataset (train + validation), if you never look at the class labels during preprocessing?

For example, suppose you perform PCA on your complete dataset (using all datapoints, without looking at the class labels), then discard some dimensions, and then apply a supervised classifier using the new PCA predictors. Does this constitute data snooping?

From my interpretation of your lecture on data snooping, I expect you would call this snooping. However, this position would be in disagreement with the popular textbook by Hastie, Tibshirani & Friedman 2009 (http://statweb.stanford.edu/~tibs/ElemStatLearn/). On page 246 they seem to say that any unsupervised processing is ok:

“In general, with a multistep modeling procedure, cross-validation must
be applied to the entire sequence of modeling steps. In particular, samples
must be “left out” before any selection or filtering steps are applied. There
is one qualification: initial unsupervised screening steps can be done before
samples are left out.”

Could you please comment on this issue and on the quote from Hastie et al?

Thanks!
__________________
Have faith in probability
Reply With Quote