![]() |
|
#1
|
|||
|
|||
![]()
In a supervised classification setting, is it data-snooping to perform unsupervised preprocessing on your complete dataset (train + validation), if you never look at the class labels during preprocessing?
For example, suppose you perform PCA on your complete dataset (using all datapoints, without looking at the class labels), then discard some dimensions, and then apply a supervised classifier using the new PCA predictors. Does this constitute data snooping? From my interpretation of your lecture on data snooping, I expect you would call this snooping. However, this position would be in disagreement with the popular textbook by Hastie, Tibshirani & Friedman 2009 (http://statweb.stanford.edu/~tibs/ElemStatLearn/). On page 246 they seem to say that any unsupervised processing is ok: “In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps. In particular, samples must be “left out” before any selection or filtering steps are applied. There is one qualification: initial unsupervised screening steps can be done before samples are left out.” Could you please comment on this issue and on the quote from Hastie et al? Thanks! |
#2
|
||||
|
||||
![]()
Strictly speaking, one can construct an unsupervised scheme (model selection based on data inputs only, without involving the output labels) that can ruin the chances for good generalization in subsequent use of supervised learning. One example, admittedly extreme, is to take
![]() ![]() ![]() ![]() Practically speaking, 'reasonable' processing and model selection based on the data inputs doesn't normally contaminate the data much for subsequent supervised learning, so it is not unreasonable to state a heuristic rule that unsupervised processing of the data does not constitute data snooping. In the financial data experiment given in Example 5.3 (also in Lecture 17 of the LFD course), the normalization step that constituted data snooping involved the outputs as well, since it was done on the daily returns before structuring the data into inputs and outputs, and the outputs were indeed among the daily returns.
__________________
Where everyone thinks alike, no one thinks very much |
#3
|
||||
|
||||
![]()
The degree of snooping can depend on the nature of the unsupervised preprocessinig, and so to be safe you must leave your points out even before unsupervised filtering. It is safest to adhere to the first part of the quote you mentioned from the Hastie book:
“In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps. In particular, samples must be “left out” before any selection or filtering steps are applied." Interestingly, the specific example you mentioned about PCA dimension reduction is considered in Problem 9.10 of e-Chapter 9 which is posted on this forum. If you perform the experiment, you will find that this particular form of unsupervised input snooping can significantly bias your LOO-CV estimate of Eout. Quote:
__________________
Have faith in probability |
#4
|
|||
|
|||
![]()
Thanks to you both for replying.
It seems you disagree a bit about how significant the bias might be..? I tried exercise 9.10 and got the following results (with 40,000 repetitions): LOOCV: PCA outside validation: E1 = 2.041 +- .008 (1 std err) PCA inside validation: E2 = 2.530 +- .010 That strikes me as a rather large bias, especially considering it's a linear model and we're only omitting 1 point. I also tried holdout validation: 50% HOLDOUT: PCA outside validation: E1 = 2.240 +- .006 (1 std err) PCA inside validation: E2 = 2.569 +- .007 I expected the bias to be larger in this case, but it's actually smaller. I originally asked this question because I'm interested in preprocessing with more flexible nonparametric unsupervised feature learning (UFL) algorithms. I wonder if the bias would be even larger for these. The intuition I have about why there could be significant bias goes something like this: Generally speaking, a nonparametric UFL algorithm ought to allocate representational capacity to an area of the input space in proportion to how much "statistically significant structure" is present there. Using all the data, there will be a certain amount of such structure. But inside a validation fold, even though the underlying structure is the same, there will be less statistically significant structure simply because there is insufficient data to show it all. So the in-fold UFL will deliver a more impoverished representation of the input than the 'full-data' UFL, and may miss some useful structure. Furthermore, it will do no good to simply tell the in-fold UFL to allocate the same amount of capacity that the 'full' UFL used, because it would not know where to allocate it -- there will be many places in the input that have 'almost significant' structure, but some of those will really be just noise. The advantage of the 'full' UFL is that it knows which of those areas has the real structure, and so doesn't waste capacity modeling noise (overfitting). Ultimately, I want to know if the bias introduced by running UFL on all the data is "tolerable". I'm still not sure! Hastie et al seem to think so, but we seem to be coming to the opposite conclusion here. Thanks again! |
#5
|
||||
|
||||
![]()
Interesting observations, and yes, when there is input snooping, it can be counter-intuitive. Here is one way to interpret your result. Input snooping with LOO-CV lets you peek at the test input. This allows you to focus your learning to improve the prediction on that test input, for example by tailoring your PCA to include the test-input.
When you do 50% holdout, you are input-snooping a set of test inputs (half the data), so while I can focus the learning on these test inputs, I can do so only on `average' and so I will not be able to excessively input-snoop any particular one. With LOO-CV, you can focus on snooping on one test input at a time, which explains why the bias with LOO-CV can be higher. Quote:
__________________
Have faith in probability |
#6
|
|||
|
|||
![]() Quote:
|
![]() |
Thread Tools | |
Display Modes | |
|
|