LFD Book Forum  

Go Back   LFD Book Forum > Book Feedback - Learning From Data > Chapter 5 - Three Learning Principles

Reply
 
Thread Tools Display Modes
  #1  
Old 12-04-2014, 06:46 AM
Don Mathis Don Mathis is offline
Junior Member
 
Join Date: Dec 2014
Posts: 2
Default Snooping and unsupervised preprocessing

In a supervised classification setting, is it data-snooping to perform unsupervised preprocessing on your complete dataset (train + validation), if you never look at the class labels during preprocessing?

For example, suppose you perform PCA on your complete dataset (using all datapoints, without looking at the class labels), then discard some dimensions, and then apply a supervised classifier using the new PCA predictors. Does this constitute data snooping?

From my interpretation of your lecture on data snooping, I expect you would call this snooping. However, this position would be in disagreement with the popular textbook by Hastie, Tibshirani & Friedman 2009 (http://statweb.stanford.edu/~tibs/ElemStatLearn/). On page 246 they seem to say that any unsupervised processing is ok:

“In general, with a multistep modeling procedure, cross-validation must
be applied to the entire sequence of modeling steps. In particular, samples
must be “left out” before any selection or filtering steps are applied. There
is one qualification: initial unsupervised screening steps can be done before
samples are left out.”

Could you please comment on this issue and on the quote from Hastie et al?

Thanks!
Reply With Quote
  #2  
Old 12-04-2014, 11:01 AM
yaser's Avatar
yaser yaser is offline
Caltech
 
Join Date: Aug 2009
Location: Pasadena, California, USA
Posts: 1,474
Default Re: Snooping and unsupervised preprocessing

Strictly speaking, one can construct an unsupervised scheme (model selection based on data inputs only, without involving the output labels) that can ruin the chances for good generalization in subsequent use of supervised learning. One example, admittedly extreme, is to take {\bf x}_1,{\bf x}_2,\dots,{\bf x}_N and select a model that is free to choose the values of the output on this specific set of points, but forces an output of zero outside that set. This model can then achieve E_{\rm in}=0 in supervised learning by matching y_1,y_2,\dots,y_N, but will obviously perform poorly out of sample since it is forced to output 0 on any other point.

Practically speaking, 'reasonable' processing and model selection based on the data inputs doesn't normally contaminate the data much for subsequent supervised learning, so it is not unreasonable to state a heuristic rule that unsupervised processing of the data does not constitute data snooping.

In the financial data experiment given in Example 5.3 (also in Lecture 17 of the LFD course), the normalization step that constituted data snooping involved the outputs as well, since it was done on the daily returns before structuring the data into inputs and outputs, and the outputs were indeed among the daily returns.
__________________
Where everyone thinks alike, no one thinks very much
Reply With Quote
  #3  
Old 12-05-2014, 07:24 AM
magdon's Avatar
magdon magdon is offline
RPI
 
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 592
Default Re: Snooping and unsupervised preprocessing

The degree of snooping can depend on the nature of the unsupervised preprocessinig, and so to be safe you must leave your points out even before unsupervised filtering. It is safest to adhere to the first part of the quote you mentioned from the Hastie book:

“In general, with a multistep modeling procedure, cross-validation must
be applied to the entire sequence of modeling steps. In particular, samples
must be “left out” before any selection or filtering steps are applied."

Interestingly, the specific example you mentioned about PCA dimension reduction is considered in Problem 9.10 of e-Chapter 9 which is posted on this forum. If you perform the experiment, you will find that this particular form of unsupervised input snooping can significantly bias your LOO-CV estimate of Eout.


Quote:
Originally Posted by Don Mathis View Post
In a supervised classification setting, is it data-snooping to perform unsupervised preprocessing on your complete dataset (train + validation), if you never look at the class labels during preprocessing?

For example, suppose you perform PCA on your complete dataset (using all datapoints, without looking at the class labels), then discard some dimensions, and then apply a supervised classifier using the new PCA predictors. Does this constitute data snooping?

From my interpretation of your lecture on data snooping, I expect you would call this snooping. However, this position would be in disagreement with the popular textbook by Hastie, Tibshirani & Friedman 2009 (http://statweb.stanford.edu/~tibs/ElemStatLearn/). On page 246 they seem to say that any unsupervised processing is ok:

“In general, with a multistep modeling procedure, cross-validation must
be applied to the entire sequence of modeling steps. In particular, samples
must be “left out” before any selection or filtering steps are applied. There
is one qualification: initial unsupervised screening steps can be done before
samples are left out.”

Could you please comment on this issue and on the quote from Hastie et al?

Thanks!
__________________
Have faith in probability
Reply With Quote
  #4  
Old 12-11-2014, 12:10 PM
Don Mathis Don Mathis is offline
Junior Member
 
Join Date: Dec 2014
Posts: 2
Default Re: Snooping and unsupervised preprocessing

Thanks to you both for replying.

It seems you disagree a bit about how significant the bias might be..?

I tried exercise 9.10 and got the following results (with 40,000 repetitions):

LOOCV:
PCA outside validation: E1 = 2.041 +- .008 (1 std err)
PCA inside validation: E2 = 2.530 +- .010

That strikes me as a rather large bias, especially considering it's a linear model and we're only omitting 1 point.
I also tried holdout validation:

50% HOLDOUT:
PCA outside validation: E1 = 2.240 +- .006 (1 std err)
PCA inside validation: E2 = 2.569 +- .007

I expected the bias to be larger in this case, but it's actually smaller.

I originally asked this question because I'm interested in preprocessing with more flexible nonparametric unsupervised feature learning (UFL) algorithms. I wonder if the bias would be even larger for these. The intuition I have about why there could be significant bias goes something like this:

Generally speaking, a nonparametric UFL algorithm ought to allocate representational capacity to an area of the input space in proportion to how much "statistically significant structure" is present there. Using all the data, there will be a certain amount of such structure. But inside a validation fold, even though the underlying structure is the same, there will be less statistically significant structure simply because there is insufficient data to show it all. So the in-fold UFL will deliver a more impoverished representation of the input than the 'full-data' UFL, and may miss some useful structure. Furthermore, it will do no good to simply tell the in-fold UFL to allocate the same amount of capacity that the 'full' UFL used, because it would not know where to allocate it -- there will be many places in the input that have 'almost significant' structure, but some of those will really be just noise. The advantage of the 'full' UFL is that it knows which of those areas has the real structure, and so doesn't waste capacity modeling noise (overfitting).

Ultimately, I want to know if the bias introduced by running UFL on all the data is "tolerable". I'm still not sure! Hastie et al seem to think so, but we seem to be coming to the opposite conclusion here.

Thanks again!
Reply With Quote
  #5  
Old 12-19-2014, 08:09 AM
magdon's Avatar
magdon magdon is offline
RPI
 
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 592
Default Re: Snooping and unsupervised preprocessing

Interesting observations, and yes, when there is input snooping, it can be counter-intuitive. Here is one way to interpret your result. Input snooping with LOO-CV lets you peek at the test input. This allows you to focus your learning to improve the prediction on that test input, for example by tailoring your PCA to include the test-input.

When you do 50% holdout, you are input-snooping a set of test inputs (half the data), so while I can focus the learning on these test inputs, I can do so only on `average' and so I will not be able to excessively input-snoop any particular one. With LOO-CV, you can focus on snooping on one test input at a time, which explains why the bias with LOO-CV can be higher.



Quote:
Originally Posted by Don Mathis View Post
Thanks to you both for replying.

It seems you disagree a bit about how significant the bias might be..?

I tried exercise 9.10 and got the following results (with 40,000 repetitions):

LOOCV:
PCA outside validation: E1 = 2.041 +- .008 (1 std err)
PCA inside validation: E2 = 2.530 +- .010

That strikes me as a rather large bias, especially considering it's a linear model and we're only omitting 1 point.
I also tried holdout validation:

50% HOLDOUT:
PCA outside validation: E1 = 2.240 +- .006 (1 std err)
PCA inside validation: E2 = 2.569 +- .007

I expected the bias to be larger in this case, but it's actually smaller.

I originally asked this question because I'm interested in preprocessing with more flexible nonparametric unsupervised feature learning (UFL) algorithms. I wonder if the bias would be even larger for these. The intuition I have about why there could be significant bias goes something like this:

Generally speaking, a nonparametric UFL algorithm ought to allocate representational capacity to an area of the input space in proportion to how much "statistically significant structure" is present there. Using all the data, there will be a certain amount of such structure. But inside a validation fold, even though the underlying structure is the same, there will be less statistically significant structure simply because there is insufficient data to show it all. So the in-fold UFL will deliver a more impoverished representation of the input than the 'full-data' UFL, and may miss some useful structure. Furthermore, it will do no good to simply tell the in-fold UFL to allocate the same amount of capacity that the 'full' UFL used, because it would not know where to allocate it -- there will be many places in the input that have 'almost significant' structure, but some of those will really be just noise. The advantage of the 'full' UFL is that it knows which of those areas has the real structure, and so doesn't waste capacity modeling noise (overfitting).

Ultimately, I want to know if the bias introduced by running UFL on all the data is "tolerable". I'm still not sure! Hastie et al seem to think so, but we seem to be coming to the opposite conclusion here.

Thanks again!
__________________
Have faith in probability
Reply With Quote
  #6  
Old 05-12-2016, 03:24 AM
elyoum elyoum is offline
Junior Member
 
Join Date: May 2016
Posts: 3
Default Re: Snooping and unsupervised preprocessing

Quote:
Originally Posted by Don Mathis View Post
Thanks to you both for replying.

It seems you disagree a bit about how significant the bias might be..?

I tried exercise 9.10 and got the following results (with 40,000 repetitions):

LOOCV:
PCA outside validation: E1 = 2.041 +- .008 (1 std err)
PCA inside validation: E2 = 2.530 +- .010

That strikes me as a rather large bias, especially considering it's a linear model and we're only omitting 1 point.
I also tried holdout validation:

50% HOLDOUT:
PCA outside validation: E1 = 2.240 +- .006 (1 std err)
PCA inside validation: E2 = 2.569 +- .007

I expected the bias to be larger in this case, but it's actually smaller.

I originally asked this question because I'm interested in preprocessing with more flexible nonparametric unsupervised feature learning (UFL) algorithms. I wonder if the bias would be even larger for these. The intuition I have about why there could be significant bias goes something like this:

Generally speaking, a nonparametric UFL algorithm ought to allocate representational capacity to an area of the input space in proportion to how much "statistically significant structure" is present there. Using all the data, there will be a certain amount of such structure. But inside a validation fold, even though the underlying structure is the same, there will be less statistically significant structure simply because there is insufficient data to show it all. So the in-fold UFL will deliver a more impoverished representation of the input than the 'full-data' UFL, and may miss some useful structure. Furthermore, it will do no good to simply tell the in-fold UFL to allocate the same amount of capacity that the 'full' UFL used, because it would not know where to allocate it -- there will be many places in the input that have 'almost significant' structure, but some of those will really be just noise. The advantage of the 'full' UFL is that it knows which of those areas has the real structure, and so doesn't waste capacity modeling noise (overfitting).

Ultimately, I want to know if the bias introduced by running UFL on all the data is "tolerable". I'm still not sure! Hastie et al seem to think so, but we seem to be coming to the opposite conclusion here.

Thanks again!
Interesting
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -7. The time now is 08:48 PM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.