View Single Post
  #6  
Old 11-23-2014, 07:37 AM
magdon's Avatar
magdon magdon is offline
RPI
 
Join Date: Aug 2009
Location: Troy, NY, USA.
Posts: 595
Default Re: Data Snooping with Test Set Inputs Intuition

Your questions are indeed subtle.

Indeed, it is very important to heed the warning at the bottom of page 9-5.

I highly recommend problem 9.10 as a concrete example of what can go wrong.

The problem that occurs can be illustrated with PCA, which does a form of dimensionality reduction. PCA identifies an `optimal' lower dimensional manifold on which the data sit. If you identify this manifold using test inputs, then you will (in some sense) be throwing away the least amount of the test inputs' information that you can, retaining only that part of each test input in the optimal lower dimension. Now, if you did the PCA using only the training data you will create your lower dimensional manifold to throw away the least amount of information in your training set. When you come to use this lower dimensional manifold on the test data (since it was not optimal for the test data), you will find that you may have thrown away important information in the test inputs which will hurt your test error.

The golden rule is that to make predictions on your test set, you can *only* use information from your training set. That is the way it is in practice, and that is the way you should evaluate yourself during the learning phase.

Here is a very simple way to check if you have data snooped. Before you do any learning assume the data has been split into a training and test set for you. Run your entire learning process and output your final hypothesis g. Now, go and set all your data in your test set to strange values like 0 for all the inputs and random target labels. Run your entire learning process again on this new pair of training set and perturbed test set and output your final hypothesis g'. If g'\not=g then there has been data-snooping -- the test set in some way is influencing your choice of g.

In learning from data, you must pay a price for any choices made using the data. Sometimes the price can be small or even zero, and sometimes it can be high. With snooping through input-preprocessing, the price is not easy to quantify, however, it is non-zero.


Quote:
Originally Posted by daniel0 View Post
I just got home, so I was able to read through some of chapter 5 on data snooping. It seems that the problem referenced with exchange rate predictions is particularly vulnerable to the problem. I can't express formally at the moment, but it seems like labels from the test set are making their way into the training set, since input data consists of data that perfectly matches labels (that is, a label from observation i will be part of the input data of observation i+1, given the way the data set is constructed). I would be interested in the results where the same experiment is ran with a much sparser dataset, such that any given rate change only shows up in one row of data.

So I suppose that there may be cases where incorporating test input data (not labels, just the raw unsupervised input), may be benign (like the example I gave in earlier posts), but it could have consequences in non-obvious ways.

Regarding dimensionality reduction, I've seen references to negative consequences and benign consequences. I have not run any experiments myself. It sounds like it could have non-obvious consequences (similar to the consequences of using test data for getting normalization parameters from Lecture 17).

Here's the example where someone references a problem:
"I detected only half of the generalization error rate when not redoing the PCA for every surrogate model"
http://stats.stackexchange.com/quest...ain-test-split

Here's example where someone had no problem:
http://mikelove.wordpress.com/2012/0...and-test-data/

As before, any insight would be greatly appreciated, especially if any of these ideas have been formalized elsewhere.

Like I mentioned earlier, it's more obvious to me why validation may be inflated if labels from the test data were known (snooped) at the time of training. The following video provides an example and explanation:
https://www.youtube.com/watch?v=S06JpVoNaA0

-Dan
__________________
Have faith in probability
Reply With Quote