![]() |
Snooping and unsupervised preprocessing
In a supervised classification setting, is it data-snooping to perform unsupervised preprocessing on your complete dataset (train + validation), if you never look at the class labels during preprocessing?
For example, suppose you perform PCA on your complete dataset (using all datapoints, without looking at the class labels), then discard some dimensions, and then apply a supervised classifier using the new PCA predictors. Does this constitute data snooping? From my interpretation of your lecture on data snooping, I expect you would call this snooping. However, this position would be in disagreement with the popular textbook by Hastie, Tibshirani & Friedman 2009 (http://statweb.stanford.edu/~tibs/ElemStatLearn/). On page 246 they seem to say that any unsupervised processing is ok: “In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps. In particular, samples must be “left out” before any selection or filtering steps are applied. There is one qualification: initial unsupervised screening steps can be done before samples are left out.” Could you please comment on this issue and on the quote from Hastie et al? Thanks! |
Re: Snooping and unsupervised preprocessing
The degree of snooping can depend on the nature of the unsupervised preprocessinig, and so to be safe you must leave your points out even before unsupervised filtering. It is safest to adhere to the first part of the quote you mentioned from the Hastie book:
“In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps. In particular, samples must be “left out” before any selection or filtering steps are applied." Interestingly, the specific example you mentioned about PCA dimension reduction is considered in Problem 9.10 of e-Chapter 9 which is posted on this forum. If you perform the experiment, you will find that this particular form of unsupervised input snooping can significantly bias your LOO-CV estimate of Eout. Quote:
|
Re: Snooping and unsupervised preprocessing
Thanks to you both for replying.
It seems you disagree a bit about how significant the bias might be..? I tried exercise 9.10 and got the following results (with 40,000 repetitions): LOOCV: PCA outside validation: E1 = 2.041 +- .008 (1 std err) PCA inside validation: E2 = 2.530 +- .010 That strikes me as a rather large bias, especially considering it's a linear model and we're only omitting 1 point. I also tried holdout validation: 50% HOLDOUT: PCA outside validation: E1 = 2.240 +- .006 (1 std err) PCA inside validation: E2 = 2.569 +- .007 I expected the bias to be larger in this case, but it's actually smaller. I originally asked this question because I'm interested in preprocessing with more flexible nonparametric unsupervised feature learning (UFL) algorithms. I wonder if the bias would be even larger for these. The intuition I have about why there could be significant bias goes something like this: Generally speaking, a nonparametric UFL algorithm ought to allocate representational capacity to an area of the input space in proportion to how much "statistically significant structure" is present there. Using all the data, there will be a certain amount of such structure. But inside a validation fold, even though the underlying structure is the same, there will be less statistically significant structure simply because there is insufficient data to show it all. So the in-fold UFL will deliver a more impoverished representation of the input than the 'full-data' UFL, and may miss some useful structure. Furthermore, it will do no good to simply tell the in-fold UFL to allocate the same amount of capacity that the 'full' UFL used, because it would not know where to allocate it -- there will be many places in the input that have 'almost significant' structure, but some of those will really be just noise. The advantage of the 'full' UFL is that it knows which of those areas has the real structure, and so doesn't waste capacity modeling noise (overfitting). Ultimately, I want to know if the bias introduced by running UFL on all the data is "tolerable". I'm still not sure! Hastie et al seem to think so, but we seem to be coming to the opposite conclusion here. Thanks again! |
Re: Snooping and unsupervised preprocessing
Interesting observations, and yes, when there is input snooping, it can be counter-intuitive. Here is one way to interpret your result. Input snooping with LOO-CV lets you peek at the test input. This allows you to focus your learning to improve the prediction on that test input, for example by tailoring your PCA to include the test-input.
When you do 50% holdout, you are input-snooping a set of test inputs (half the data), so while I can focus the learning on these test inputs, I can do so only on `average' and so I will not be able to excessively input-snoop any particular one. With LOO-CV, you can focus on snooping on one test input at a time, which explains why the bias with LOO-CV can be higher. Quote:
|
Re: Snooping and unsupervised preprocessing
Quote:
|
All times are GMT -7. The time now is 12:40 PM. |
Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.