View Single Post
Old 05-12-2016, 02:24 AM
elyoum elyoum is offline
Junior Member
Join Date: May 2016
Posts: 3
Default Re: Snooping and unsupervised preprocessing

Originally Posted by Don Mathis View Post
Thanks to you both for replying.

It seems you disagree a bit about how significant the bias might be..?

I tried exercise 9.10 and got the following results (with 40,000 repetitions):

PCA outside validation: E1 = 2.041 +- .008 (1 std err)
PCA inside validation: E2 = 2.530 +- .010

That strikes me as a rather large bias, especially considering it's a linear model and we're only omitting 1 point.
I also tried holdout validation:

PCA outside validation: E1 = 2.240 +- .006 (1 std err)
PCA inside validation: E2 = 2.569 +- .007

I expected the bias to be larger in this case, but it's actually smaller.

I originally asked this question because I'm interested in preprocessing with more flexible nonparametric unsupervised feature learning (UFL) algorithms. I wonder if the bias would be even larger for these. The intuition I have about why there could be significant bias goes something like this:

Generally speaking, a nonparametric UFL algorithm ought to allocate representational capacity to an area of the input space in proportion to how much "statistically significant structure" is present there. Using all the data, there will be a certain amount of such structure. But inside a validation fold, even though the underlying structure is the same, there will be less statistically significant structure simply because there is insufficient data to show it all. So the in-fold UFL will deliver a more impoverished representation of the input than the 'full-data' UFL, and may miss some useful structure. Furthermore, it will do no good to simply tell the in-fold UFL to allocate the same amount of capacity that the 'full' UFL used, because it would not know where to allocate it -- there will be many places in the input that have 'almost significant' structure, but some of those will really be just noise. The advantage of the 'full' UFL is that it knows which of those areas has the real structure, and so doesn't waste capacity modeling noise (overfitting).

Ultimately, I want to know if the bias introduced by running UFL on all the data is "tolerable". I'm still not sure! Hastie et al seem to think so, but we seem to be coming to the opposite conclusion here.

Thanks again!
Reply With Quote