View Single Post
Old 12-19-2014, 07:09 AM
magdon's Avatar
magdon magdon is offline
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 597
Default Re: Snooping and unsupervised preprocessing

Interesting observations, and yes, when there is input snooping, it can be counter-intuitive. Here is one way to interpret your result. Input snooping with LOO-CV lets you peek at the test input. This allows you to focus your learning to improve the prediction on that test input, for example by tailoring your PCA to include the test-input.

When you do 50% holdout, you are input-snooping a set of test inputs (half the data), so while I can focus the learning on these test inputs, I can do so only on `average' and so I will not be able to excessively input-snoop any particular one. With LOO-CV, you can focus on snooping on one test input at a time, which explains why the bias with LOO-CV can be higher.

Originally Posted by Don Mathis View Post
Thanks to you both for replying.

It seems you disagree a bit about how significant the bias might be..?

I tried exercise 9.10 and got the following results (with 40,000 repetitions):

PCA outside validation: E1 = 2.041 +- .008 (1 std err)
PCA inside validation: E2 = 2.530 +- .010

That strikes me as a rather large bias, especially considering it's a linear model and we're only omitting 1 point.
I also tried holdout validation:

PCA outside validation: E1 = 2.240 +- .006 (1 std err)
PCA inside validation: E2 = 2.569 +- .007

I expected the bias to be larger in this case, but it's actually smaller.

I originally asked this question because I'm interested in preprocessing with more flexible nonparametric unsupervised feature learning (UFL) algorithms. I wonder if the bias would be even larger for these. The intuition I have about why there could be significant bias goes something like this:

Generally speaking, a nonparametric UFL algorithm ought to allocate representational capacity to an area of the input space in proportion to how much "statistically significant structure" is present there. Using all the data, there will be a certain amount of such structure. But inside a validation fold, even though the underlying structure is the same, there will be less statistically significant structure simply because there is insufficient data to show it all. So the in-fold UFL will deliver a more impoverished representation of the input than the 'full-data' UFL, and may miss some useful structure. Furthermore, it will do no good to simply tell the in-fold UFL to allocate the same amount of capacity that the 'full' UFL used, because it would not know where to allocate it -- there will be many places in the input that have 'almost significant' structure, but some of those will really be just noise. The advantage of the 'full' UFL is that it knows which of those areas has the real structure, and so doesn't waste capacity modeling noise (overfitting).

Ultimately, I want to know if the bias introduced by running UFL on all the data is "tolerable". I'm still not sure! Hastie et al seem to think so, but we seem to be coming to the opposite conclusion here.

Thanks again!
Have faith in probability
Reply With Quote