Your questions are indeed subtle.
Indeed, it is very important to heed the warning at the bottom of page 95.
I highly recommend problem 9.10 as a concrete example of what can go wrong.
The problem that occurs can be illustrated with PCA, which does a form of dimensionality reduction. PCA identifies an `optimal' lower dimensional manifold on which the data sit. If you identify this manifold using test inputs, then you will (in some sense) be throwing away the least amount of the test inputs' information that you can, retaining only that part of each test input in the optimal lower dimension. Now, if you did the PCA using only the training data you will create your lower dimensional manifold to throw away the least amount of information in your training set. When you come to use this lower dimensional manifold on the test data (since it was not optimal for the test data), you will find that you may have thrown away important information in the test inputs which will hurt your test error.
The golden rule is that to make predictions on your test set, you can *only* use information from your training set. That is the way it is in practice, and that is the way you should evaluate yourself during the learning phase.
Here is a very simple way to check if you have data snooped.
Before you do any learning assume the data has been split into a training and test set for you. Run your
entire learning process and output your final hypothesis
. Now, go and set all your data in your test set to strange values like 0 for all the inputs and random target labels. Run your
entire learning process again on this new pair of training set and perturbed test set and output your final hypothesis
. If
then there has been datasnooping  the test set in some way is influencing your choice of
.
In learning from data, you must pay a price for any choices made using the data. Sometimes the price can be small or even zero, and sometimes it can be high. With snooping through inputpreprocessing, the price is not easy to quantify, however, it is nonzero.
Quote:
Originally Posted by daniel0
I just got home, so I was able to read through some of chapter 5 on data snooping. It seems that the problem referenced with exchange rate predictions is particularly vulnerable to the problem. I can't express formally at the moment, but it seems like labels from the test set are making their way into the training set, since input data consists of data that perfectly matches labels (that is, a label from observation i will be part of the input data of observation i+1, given the way the data set is constructed). I would be interested in the results where the same experiment is ran with a much sparser dataset, such that any given rate change only shows up in one row of data.
So I suppose that there may be cases where incorporating test input data (not labels, just the raw unsupervised input), may be benign (like the example I gave in earlier posts), but it could have consequences in nonobvious ways.
Regarding dimensionality reduction, I've seen references to negative consequences and benign consequences. I have not run any experiments myself. It sounds like it could have nonobvious consequences (similar to the consequences of using test data for getting normalization parameters from Lecture 17).
Here's the example where someone references a problem:
"I detected only half of the generalization error rate when not redoing the PCA for every surrogate model"
http://stats.stackexchange.com/quest...aintestsplit
Here's example where someone had no problem:
http://mikelove.wordpress.com/2012/0...andtestdata/
As before, any insight would be greatly appreciated, especially if any of these ideas have been formalized elsewhere.
Like I mentioned earlier, it's more obvious to me why validation may be inflated if labels from the test data were known (snooped) at the time of training. The following video provides an example and explanation:
https://www.youtube.com/watch?v=S06JpVoNaA0
Dan
