View Single Post
  #1  
Old 12-04-2014, 06:46 AM
Don Mathis Don Mathis is offline
Junior Member
 
Join Date: Dec 2014
Posts: 2
Default Snooping and unsupervised preprocessing

In a supervised classification setting, is it data-snooping to perform unsupervised preprocessing on your complete dataset (train + validation), if you never look at the class labels during preprocessing?

For example, suppose you perform PCA on your complete dataset (using all datapoints, without looking at the class labels), then discard some dimensions, and then apply a supervised classifier using the new PCA predictors. Does this constitute data snooping?

From my interpretation of your lecture on data snooping, I expect you would call this snooping. However, this position would be in disagreement with the popular textbook by Hastie, Tibshirani & Friedman 2009 (http://statweb.stanford.edu/~tibs/ElemStatLearn/). On page 246 they seem to say that any unsupervised processing is ok:

“In general, with a multistep modeling procedure, cross-validation must
be applied to the entire sequence of modeling steps. In particular, samples
must be “left out” before any selection or filtering steps are applied. There
is one qualification: initial unsupervised screening steps can be done before
samples are left out.”

Could you please comment on this issue and on the quote from Hastie et al?

Thanks!
Reply With Quote